Managing Grid and Web Services and their exchanged messages OGF19 Workshop on Reliability and Robustness Friday Center Chapel Hill NC January 31 2007 Authors.

Managing Grid and Web Services and their exchanged messages OGF19 Workshop on Reliability and Robustness Friday Center Chapel Hill NC January 31 2007 Authors Harshawardhan Gadgil (his PhD topic), Geoffrey Fox, Shrideep Pallickara, Marlon Pierce Community Grids Lab, Indiana University Presented by Geoffrey Fox gcf@indiana.edu

2 Management Problem I Characteristics of today’s (Grid) applications Characteristics of today’s (Grid) applications –Increasing complexity –Components widely dispersed and disparate in nature and access  Span different administrative domains  Under differing network / security policies  Limited access to resources due to presence of firewalls, NATs etc… (major focus in prototype) –Dynamic  Components (Nodes, network, processes) may fail Services must meet Services must meet –General QoS and Life-cycle features –(User defined) Application specific criteria Need to “manage” services to provide these capabilities Need to “manage” services to provide these capabilities –Dynamic monitoring and recovery –Static configuration and composition of systems from subsystems

3 Management Problem II Management Operations* include Management Operations* include –Configuration and Lifecycle operations (CREATE, DELETE) –Handle RUNTIME events –Monitor status and performance –Maintain system state (according to user defined criteria) Protocols like WS-Management/WS-DM define inter-service negotiation and how to transfer metadata Protocols like WS-Management/WS-DM define inter-service negotiation and how to transfer metadata We are designing/prototyping a system that will manage a general world wide collection of services and their network links We are designing/prototyping a system that will manage a general world wide collection of services and their network links –Need to address Fault Tolerance, Scalability, Performance, Interoperability, Generality, Usability We are starting with our messaging infrastructure as We are starting with our messaging infrastructure as –we need this to be robust in Grids we are using it in (Sensor and material science) –we are using it in management system –and it has critical network requirements * * From WS – Distributed Management http://devresource.hp.com/drc/slide_presentations/wsdm/index.jsp http://devresource.hp.com/drc/slide_presentations/wsdm/index.jsp

4 Core Features of Management Architecture Remote Management Remote Management –Allow management irrespective of the location of the resource (as long as that resource is reachable via some means) Traverse firewalls and NATs Traverse firewalls and NATs –Firewalls complicate management by disabling access to some transports and access to internal resources –Utilize tunneling capabilities and multi-protocol support of messaging infrastructure Extensible Extensible –Management capabilities evolve with time. We use a service oriented architecture to provide extensibility and interoperability Scalable Scalable –Management architecture should be scale as number of managees increases Fault-tolerant Fault-tolerant –Management itself must be fault-tolerant. Failure of transports OR management components should not cause management architecture to fail.

Management Architecture built in terms of Hierarchical Bootstrap System – Robust itself by Replication managees in different domains can be managed with separate policies for each domain Periodically spawns a System Health Check that ensures components are up and running Registry for metadata (distributed database) – Robust by standard database techniques and our system itself for Service Interfaces Stores managee specific information (User-defined configuration / policies, external state required to properly manage a managee) Generates a unique ID per instance of registered component

Architecture: Scalability: Hierarchical distribution Replicated ROOT US EUROPE FSUCARDIFFCGL Active Bootstrap Nodes /ROOT/EUROPE/CARDIFF Responsible for maintaining a working set of management components in the domain Always the leaf nodes in the hierarchy Passive Bootstrap Nodes Only ensure that all child bootstrap nodes are always up and running Spawns if not present and ensure up and running …

Messaging Nodes form a scalable messaging substrate Message delivery between managers and managees Provides transport protocol independent messaging between distributed entities Can provide Secure delivery of messages Managers – Active stateless agents that manage managees. Both general and managee specific management threads performs actual management Multi-threaded to improve scalability with many managees Managees – what you are managing (managee / service to manage) – Our system makes robust There is NO assumption that Managed system uses Messaging nodes Wrapped by a Service Adapter which provides a Web Service interface Assumed that ONLY modest state needed to be stored/restored externally. Managee could front end and restore itself a huge database Management Architecture built in terms of

Architecture: Conceptual Idea (Internals) Connect to Messaging Node for sending and receiving messages User writes system configuration to registry Manager processes periodically checks available managees to manage. Also Read/Write managee specific external state from/to registry Always ensure up and running Periodically Spawn WS Management

Architecture: User Component “Managee Characteristics” are determined by the user. Events generated by the Managees are handled by the manager Event processing is determined by via WS- Policy constructs E.g. Wait for user’s decision on handling specific conditions “Auto Instantiate” a failed service but service responsible for doing this consistently even when failed service not failed but just unreachable Administrators can set up services (managees) by defining characteristics Writing information to registry can be used to start up a set of services

Issues in the distributed system Consistency Examples of inconsistent behavior Two or more managers managing the same managee Old messages / requests reaching after new requests Multiple copies of managees existing at the same time / Orphan managees leading to inconsistent system state Use a Registry generated monotonically increasing Unique Instance ID (IID) to distinguish between new and old instances of Managers, Managees and Messages Requests from manager thread A are considered obsolete IF IID(A) < IID(B) Service Adapter stores the last known MessageID (IID:seqNo) allowing it to differentiate between duplicates AND obsolete messages Periodic renewal with registry IF IID(manageeInstance_1) < IID(manageeInstance_2) THEN manageeInstance_1 was deemed OBSOLETE SO EXECUTE Policy (E.g. Instruct manageeInstance_1 to silently shutdown)

Issues in the distributed system Security Security – Provide secure communication between communicating parties (e.g. Manager Managee) Publish/Subscribe:- Provenance, Lifetime, Unique Topics Secure Discovery of endpoints Prevent unauthorized users from accessing the Managers or Managees Prevent malicious users from modifying message (Thus message interactions are secure when passing through insecure intermediaries) Utilize NaradaBrokering’s Topic Creation and Discovery * and Security Scheme # * NB-Topic Creation and Discovery (Grid2005) http://grids.ucs.indiana.edu/ptliupages/publications/NB-TopicDiscovery-IJHPCN.pdf http://grids.ucs.indiana.edu/ptliupages/publications/NB-TopicDiscovery-IJHPCN.pdf # NB-Security (Grid2006) http://grids.ucs.indiana.edu/ptliupages/publications/NB-SecurityGrid06.pdf http://grids.ucs.indiana.edu/ptliupages/publications/NB-SecurityGrid06.pdf

Implemented: WS – Specifications WS – Management (June 2005) parts (WS – Transfer [Sep 2004], WS – Enumeration [Sep 2004] and WS – Eventing) (could use WS- DM) WS – Eventing (Leveraged from the WS – Eventing capability implemented in OMII) WS – Addressing [Aug 2004] and SOAP v 1.2 used (needed for WS- Management) Used XmlBeans 2.0.0 for manipulating XML in custom container. Currently implemented using JDK 1.4.2 but will switch to JDK1.5 Released on http://www.naradabrokering.org in February 2007http://www.naradabrokering.org

Performance Evaluation Results Extreme case with many catastrophic failures Response time increases with increasing number of concurrent requests Response time is MANAGEE-DEPENDENT and the shown times are typical MAY involve 1 or more Registry access which will increase overall response time Increases rapidly as no. of Managees > (150 – 200) managees

Performance Evaluation How much infrastructure is required to manage N managees ? N = Number of managees to manage M = Max. no. of entities connected to a single messaging node D = Max. no of managees managed by a single manager process R = min. no. of registry service instances required to provide fault- tolerance Assume every leaf domain has 1 messaging node. Hence we have N/M leaf domains. Further, No. of managers required per leaf domain is M/D Total Components in lowest level = (R registry + 1 Bootstrap Service + 1 Messaging Node + M/D Managers) * (N/M such leaf domains) = (2 + R + M/D) * (N/M) Thus percentage of additional infrastructure is = [(2 +R)/M + 1/D] * 100 %

Performance Evaluation Research Question: How much infrastructure is required to manage N managees ? Additional infrastructure = [(2 +R)/M + 1/D] * 100 % A Few Cases Typical values of D and M are 200 and 800 and assuming R = 4, then Additional Infrastructure = [(2+4)/800 + 1/200] * 100 % ≈ 1.2 % Shared Registry => there is one registry interface per domain, R = 1, then Additional Infrastructure = [(2+1)/800 + 1/200] * 100 % ≈ 0.87 % If NO messaging node is used (assume D = 200), then Additional Infrastructure = [(R registry + 1 bootstrap node + N/D managers)/N] * 100 % = [(1+R)/N + 1/D] * 100 % ≈ 100/D % (for N >> R) ≈ 0.5%

Performance Evaluation Research Question: How much infrastructure is required to manage N managees ?

Performance Evaluation XML Processing Overhead XML Processing overhead is measured as the total marshalling and un-marshalling time required. In case of Broker Management interactions, typical processing time (includes validation against schema) ≈ 5 ms Broker Management operations invoked only during initialization and failure from recovery Reading Broker State using a GET operation involves 5ms overhead and is invoked periodically (E.g. every 1 minute, depending on policy) Further, for most operation dealing with changing broker state, actual operation processing time >> 5ms and hence the XML overhead of 5 ms is acceptable.

Prototype: Managing Grid Messaging Middleware We illustrate the architecture by managing the distributed messaging middleware: NaradaBrokering This example motivated by the presence of large number of dynamic peers (brokers) that need configuration and deployment in specific topologies Runtime metrics provide dynamic hints on improving routing which leads to redeployment of messaging system (possibly) using a different configuration and topology Can use (dynamically) optimized protocols (UDP v TCP v Parallel TCP) and go through firewalls Broker Service Adapter Note NB illustrates an electronic entity that didn’t start off with an administrative Service interface So add wrapper over the basic NB BrokerNode object that provides WS – Management front-end Allows CREATION, CONFIGURATION and MODIFICATION of broker and broker topologies

Messaging (NaradaBrokering) Architecture 19

June 19, 2006 Community Grids Lab, Bloomington IN :CLADE 2006: 20 Typical use of Grid Messaging in NASA Datamining Grid Sensor Grid implementing using NB NB GIS Grid

21 NaradaBrokering Management Needs NaradaBrokering Distributed Messaging System consists of peers (brokers) that collectively form a scalable messaging substrate. Optimizations and configurations include: NaradaBrokering Distributed Messaging System consists of peers (brokers) that collectively form a scalable messaging substrate. Optimizations and configurations include: –Where should brokers be placed and how should they be connected, E.g. RING, BUS, TREE, HYPERCUBE etc…, each TOPOLOGY has varying degree of resource utilization, routing, cost and fault-tolerance characteristics. Static topologies or topologies created using static rules may be inefficient in some cases Static topologies or topologies created using static rules may be inefficient in some cases –E.g., In CAN, Chord a new incoming peer randomly joins nodes in the network. Network distances are not taken into account and hence some lookup queries may span entire diameter of network –Runtime metrics provide dynamic hints on improving routing which leads to redeployment of messaging system (possibly) using a different configuration and topology –Can use (dynamically) optimized protocols (UDP v TCP v Parallel TCP) and go through firewalls but no good way to make choices dynamically These actions collectively termed as Managing the Messaging Middleware These actions collectively termed as Managing the Messaging Middleware

Prototype: Costs (Individual Managees are NaradaBrokering Brokers) Operation Time (msec) (average values) Un-Initialized (First time) Initialized (Later modifications) Set Configuration 778 ± 533 ± 3 Create Broker 610 ± 657 ± 2 Create Link 160 ± 227 ± 2 Delete Link 104 ± 220 ± 1 Delete Broker 142 ± 1129 ± 2

Recovery: Typical Time Topology Number of managee specific Configuration Entries Recovery Time = T(Read State From Registry) + T(Bring managee up to speed) = T(Read State) + T[SetConfig + Create Broker + CreateLink(s)] Ring N nodes, N links (1 outgoing link per Node) 2 managee Objects Per node 10 + (778.0 + 610.1 + 160.5)  1548 msec Cluster N nodes, Links per broker vary from 0 – 3 1 – 4 managee Objects per node Min: 5 + 778.0 + 610.1  1393 msec Max: 20 + 778.0 + 610.1 + 160.5 + 2* 26.67  1622 msec Assuming 5ms Read time from registry per managee object

Prototype: Observed Recovery Cost per managee OperationAverage (msec) *Spawn Process 2362 ± 18 Read State 8 ± 1 Restore (1 Broker + 1 Link) 1421 ± 9 Restore (1 Broker + 3 Link) 1616 ± 82 Time for Create Broker depends on the number & type of transports opened by the broker E.g. SSL transport requires negotiation of keys and would require more time than simply establishing a TCP connection If brokers connect to other brokers, the destination broker MUST be ready to accept connections, else topology recovery takes more time.

Management Console: Creating Nodes and Setting Properties

Management Console: Creating Links

Management Console: Policies

Management Console: Creating Topologies

Conclusion We have presented a scalable, fault-tolerant management framework that Adds acceptable cost in terms of extra resources required (about 1%) Provides a general framework for management of distributed entities Is compatible with existing Web Service specifications We have applied our framework to manage Managees that are loosely coupled and have modest external state (important to improve scalability of management process) Outside effort is developing a Grid Builder which combines BPEL and this management system to manage initial specification, composition, and execution of Grids of Grids (of Services)

Managing Grid and Web Services and their exchanged messages OGF19 Workshop on Reliability and Robustness Friday Center Chapel Hill NC January 31 2007 Authors.

Similar presentations

Presentation on theme: "Managing Grid and Web Services and their exchanged messages OGF19 Workshop on Reliability and Robustness Friday Center Chapel Hill NC January 31 2007 Authors."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Managing Grid and Web Services and their exchanged messages OGF19 Workshop on Reliability and Robustness Friday Center Chapel Hill NC January 31 2007 Authors.

Similar presentations

Presentation on theme: "Managing Grid and Web Services and their exchanged messages OGF19 Workshop on Reliability and Robustness Friday Center Chapel Hill NC January 31 2007 Authors."— Presentation transcript:

Similar presentations

About project

Feedback