The University of Durham e-Demand Project Paul Townend 14 th April 2003 Paul Townend 14 th April 2003.

Slides:



Advertisements
Similar presentations
Writing Good Use Cases - Instructor Notes
Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
October 31, 2005Copyright © by Erik D. Demaine and Charles E. LeisersonL13.1 Introduction to Algorithms LECTURE 11 Amortized Analysis Dynamic tables.
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Using Matrices in Real Life
1 Concurrency: Deadlock and Starvation Chapter 6.
Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)
Distributed Systems Architectures
Requirements Engineering Process
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
1. 2 Configuring the Cloud Inside and out Paul Anderson publications/mysore-2010-talk.pdf School of.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
By Rick Clements Software Testing 101 By Rick Clements
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
18 Copyright © 2005, Oracle. All rights reserved. Distributing Modular Applications: Introduction to Web Services.
Objectives To introduce software project management and to describe its distinctive characteristics To discuss project planning and the planning process.
Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
1 NatQuery 3/05 An End-User Perspective On Using NatQuery To Extract Data From ADABAS Presented by Treehouse Software, Inc.
Database Systems: Design, Implementation, and Management
© Tarek Hegazy – 1 Basics of Asset Management Prof. Tarek Hegazy.
Software Engineering COMP 201
Auto-scaling Axis2 Web Services on Amazon EC2 By Afkham Azeez.
CP2073 Networking Lecture 5.
Configuration management
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 5 Slide 1 Project management.
1 Challenge the future Subtitless On Lightweight Design of Submarine Pressure Hulls.
S-Curves & the Zero Bug Bounce:
The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.
Seungmi Choi PlanetLab - Overview, History, and Future Directions - Using PlanetLab for Network Research: Myths, Realities, and Best Practices.
Campaign Overview Mailers Mailing Lists
MySQL Access Privilege System
Chapter 6 Data Design.
Page Replacement Algorithms
Chapter 3.3 : OS Policies for Virtual Memory
Legacy Systems Older software systems that remain vital to an organisation.
1 Passage Idea of the text 2. Word study.
Use Case Diagrams.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
2 |SharePoint Saturday New York City
1 What is JavaScript? JavaScript was designed to add interactivity to HTML pages JavaScript is a scripting language A scripting language is a lightweight.
VOORBLAD.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Distributed Processing, Client/Server and Clusters
IONA Technologies Position Paper Constraints and Capabilities for Web Services
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Executional Architecture
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
25 seconds left…...
Polynomial Functions of Higher Degree
Januar MDMDFSSMDMDFSSS
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
PSSA Preparation.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 30 Slide 1 Security Engineering 2.
Chapter 24 Replication and Mobile Databases Transparencies © Pearson Education Limited 1995, 2005.
TCP/IP Protocol Suite 1 Chapter 18 Upon completion you will be able to: Remote Login: Telnet Understand how TELNET works Understand the role of NVT in.
Copyright ©2009 Opher Etzion Event Processing Course Engineering and implementation considerations (related to chapter 10)
BIG DATA/ Hadoop Interview Questions.
Outline Announcements Fault Tolerance.
Interpret the execution mode of SQL query in F1 Query paper
Presentation transcript:

The University of Durham e-Demand Project Paul Townend 14 th April 2003 Paul Townend 14 th April 2003

2 About the Project The e-Demand project at Durham is basically concerned with: Construction of a Service-based Architecture Testing (Fault injection) Security (FT-PIR) Visualisation (Auto-3D) Fault Tolerance

3 Service-based Architecture The architecture that we are developing: Service consumer Contractor/assembly service provider Catalogue/ontology provider Demand Provision Finding Service/solution provider Ultra-late binding Publishing e-Action service Attack-tolerance service …

4 Testing Service Our testing service currently implements network level fault injection. Fault Injector (testing service) Client Server Service Request (may contain faults) Response (may contain faults) Middleware boundary Intercepted request Intercepted response Potentially altered request Potentially altered response

5 Security service Our Fault Tolerant Private Information Retreival service (FT-PIR) will allow users to query database records without revealing their true intentions. Client Server DB User

6 Visualisation service The e-demand project is also developing visualisation services. We hope to have a demo available for the 2 nd national All-Hands conference in Nottingham. I dont really know enough to say anymore about this area!

7 Fault Tolerance on the Grid Fault Tolerance (FT) is the main focus of this talk. FT allows a service to tolerate a fault and continue to provide its service in some fashion. There is a great need for FT in the Grid community, but currently, only the GGF Checkpoint Recovery group (Grid CPR-WG) is at work in this area. We are seeking to perform work that will further the ease with which FT can be provided on the Grid. In the following slides, we will look at the need for fault tolerance in the Grid, and look at some potential problems we may be able to resolve.

8 The Need for Fault Tolerance (1) As applications scale to take advantage of Grid resources, their size and complexity will increase dramatically Experience has shown that systems with complex asynchronous and interacting activities are very prone to errors and failures due to their extreme complexity. At the same time, many Grid applications will perform long tasks that may require several days of computation, if not more.

9 The Need for Fault Tolerance (2) In a wide-area, distributed grid environment, however, the need for fault tolerance is essential. Besides having to cope with the higher probability of faults in a large system, the cost and difficulty of containing and recovering from a fault is higher. It is unacceptable that a process, host or network failure should cause a distributed grid application to irrevocably hang or malfunction in any way such that manual intervention is required at multiple sites. – Grid RPC, Events and Messaging, C.Lee, The Aerospace Corporation, September 2001

10 Example of the need for Fault Tolerance Consider an application, decomposed into 100 services. Assume each service has a MTTR of 120 days and requires a week of computation. Assuming an exponentially distributed failure mode, the composed application would have a MTTF of 1.2 days. Without any kind of FT, the application would rarely finish.

11 Expressing Fault Tolerance Capabilities (1) Application metadata is critical for the development of Grid computing environments. Information captured by metadata supports discovery of applications in the Grid environment. It also facilitates the seamless composition of services. We are therefore seeking to create a standard way of expressing fault tolerance properties in service metadata.

12 Expressing Fault Tolerance Capabilities (2) This would allow a user to identify whether, for example, a service uses Recovery Blocks, Multi- Version Design or has no fault tolerance whatsoever. This information could then be used in both WSDL and Service Data Elements.

13 PC Grids (1) Perhaps one of the most attractive opportunities allowed by Grid technologies is the idea of the PC Grid PC PC Grid Server PC Client

14 PC Grids (2) The issue of providing fault tolerance in not as simple as it might initially appear. As the individual nodes on a PC Grid are all potentially insecure, it is evident that replication is the most suitable FT methodology to use. However, different nodes in the Grid will be running at different speeds, and have different loads at any one time. It thus becomes difficult to guarantee a job will be finished within a given amount of time.

15 PC Grids (3) Simple replication might therefore not be suitable, as the server may be left waiting for a heavily loaded node to finish and submit its results, while it already has the other nodes results. PC Grid Server 2 minutes 2 minutes 2 minutes 8 minutes 8 minutes replication block

16 PC Grids (4) In addition, PC Grids are highly dynamically – nodes may join or leave at any time. We therefore cant make any guarantees about the performance of each node. We can, however, make general assumptions about their ability to perform a job within a given time-frame, based on their hardware and historical load levels, etc. So here is an initial FT scheme we are currently looking at for providing replication on PC Grids…

17 Replication on a PC Grid (1) It may be the case that some jobs to be sent out on the PC Grid are more important than others. Ideally, we want the most important jobs (or the ones requiring most compute time) to be processed quickly. We also want to ensure that different replication blocks finish at approximately the same time, so that the PC Grid server isnt waiting to vote on jobs for too long.

18 Replication on a PC Grid (2) We also need to allow for the possibility of nodes within a replication block leaving the PC Grid (voluntarily or due to some kind of failure) Given that the resources available should be plentiful, we can therefore use more replication than we strictly need. We can then vote on the results of the first n nodes within a replication block that return results (with n being arbitrary) So we are thinking of something like the following…

19 Our Very Provisional Solution (1) Whenever a node joins the PC Grid, it must be assigned a performance category based on its hardware capabilities and load. These categories are dynamic, and continually re-assessed by the PC Grid server PC Grid Server

20 Our Very Provisional Solution (2) Similarly, when a job is submitted to a PC Grid, the scheduler must decide on the priority of the job. This may be based on whether the job requires lots of computation, or perhaps is critical, etc. It then identifies several nodes within the PC Grid that meet the performance requirements of the job, based on their performance category.

21 Our Very Provisional Solution (3) The server then sends the replicated job to several of these nodes, to form a replication block. Should resources allow, this block can contain more nodes than we need, in order to guard against some of them leaving/failing. We might specify that – should there be a lack of nodes within the given performance category – we use a mixture of nodes from other categories.

22 Example (1) Assume we have a PC Grid like this: Then assume, a job is submitted that is adjudged to have a priority of PC Grid Server

23 Example (2) We only need 3 nodes for decent replication, but we have 4 that are not used, so why not?! PC Grid Server Job from Client Job of priority 1 The jobs replication block

24 Example (3) Part way through computation, one of the nodes either leaves the PC Grid or is reallocated to another replication block – but we still have 3 left PC Grid Server The jobs replication block

25 Example (4) As each node finishes its task, it sends its results back to the server, and is free to be allocated elsewhere. The server stores the results centrally and waits for the final job in the replication block to finish PC Grid Server The jobs replication block finished

26 Example (5) Because the nodes used were of similar performance, the server will not have to wait long and hence overhead should be kept down. Eventually, the final node finishes, and the result is voted on, and - if successful – sent back to the client PC Grid Server finished

27 Acceptance Tests on the Grid (1) Speaking of voting, another area of FT that is traditionally problematic is that of acceptance testing. This is where the result of a program/service is verified by one or more tests, performed automatically. A number of FT schemes depend on such testing, but the testing itself must usually be simple, as it otherwise introduces unacceptable run-time overhead.

28 Acceptance Tests on the Grid (2) With the Grid, this problem may be solvable for some applications. Rather than process the acceptance test locally, we could send the data and either an executable or a schema specifying the test to perform, to an HPC node. The overhead of compute time would thus be decreased, although whether this will be offset by the increase in communication overhead remains to be seen.

29 Conclusions (1) The e-Demand project is multi-faceted – its looking at security, visualisation, testing and fault tolerance. The main focus of this talk has been to present some ideas we have in regard to fault tolerance. FT is obviously needed on the Grid. A standard way of expressing FT capabilities in a services metadata would be A Good Thing. We are inclined to focus initially on the problem of providing FT to PC Grids.

30 Conclusions (2) At first glance, FT on a PC Grid simply involves replication, but it soon becomes apparent that a more optimal solution involves: Assessing and grouping PC Grid nodes. Assessing and scheduling jobs. Using extra redundancy to tolerate nodes leaving. Perhaps reallocating redundant nodes on the fly. Perhaps farming out computationally expensive acceptance tests to HPC nodes. Assessing scalability of this architecture. And so on!

31 Open Issues Obviously, this work is still in its initial stages and so there are many things that need to be considered, such as: Cost (will using extra PC Grid nodes be chargeable?) How to choose nodes for replication blocks How to dynamically assess node performance How to dynamically assess job priorities What to do if consensus is not reached, or only 1 node successfully returns the job Whether a traditional distributed system fault model is applicable in a grid environment, or whether revision is needed.

32 Open Issues (2) The traditional distributed systems fault model includes events such as: Physical faults Software faults Timing faults Communication faults Life-cycle faults Is such a traditional fault model acceptable for Grid computation, or is some revision required?

33 Thanks! If you have any questions or anything, then And you might even get a reply!