Presentation on theme: "The University of Durham e-Demand Project Paul Townend 14 th April 2003 Paul Townend 14 th April 2003."— Presentation transcript:
The University of Durham e-Demand Project Paul Townend firstname.lastname@example.org 14 th April 2003 Paul Townend email@example.com 14 th April 2003
2 About the Project The e-Demand project at Durham is basically concerned with: Construction of a Service-based Architecture Testing (Fault injection) Security (FT-PIR) Visualisation (Auto-3D) Fault Tolerance
3 Service-based Architecture The architecture that we are developing: Service consumer Contractor/assembly service provider Catalogue/ontology provider Demand Provision Finding Service/solution provider Ultra-late binding Publishing e-Action service Attack-tolerance service …
4 Testing Service Our testing service currently implements network level fault injection. Fault Injector (testing service) Client Server Service Request (may contain faults) Response (may contain faults) Middleware boundary Intercepted request Intercepted response Potentially altered request Potentially altered response
5 Security service Our Fault Tolerant Private Information Retreival service (FT-PIR) will allow users to query database records without revealing their true intentions. Client Server DB User
6 Visualisation service The e-demand project is also developing visualisation services. We hope to have a demo available for the 2 nd national All-Hands conference in Nottingham. I dont really know enough to say anymore about this area!
7 Fault Tolerance on the Grid Fault Tolerance (FT) is the main focus of this talk. FT allows a service to tolerate a fault and continue to provide its service in some fashion. There is a great need for FT in the Grid community, but currently, only the GGF Checkpoint Recovery group (Grid CPR-WG) is at work in this area. We are seeking to perform work that will further the ease with which FT can be provided on the Grid. In the following slides, we will look at the need for fault tolerance in the Grid, and look at some potential problems we may be able to resolve.
8 The Need for Fault Tolerance (1) As applications scale to take advantage of Grid resources, their size and complexity will increase dramatically Experience has shown that systems with complex asynchronous and interacting activities are very prone to errors and failures due to their extreme complexity. At the same time, many Grid applications will perform long tasks that may require several days of computation, if not more.
9 The Need for Fault Tolerance (2) In a wide-area, distributed grid environment, however, the need for fault tolerance is essential. Besides having to cope with the higher probability of faults in a large system, the cost and difficulty of containing and recovering from a fault is higher. It is unacceptable that a process, host or network failure should cause a distributed grid application to irrevocably hang or malfunction in any way such that manual intervention is required at multiple sites. – Grid RPC, Events and Messaging, C.Lee, The Aerospace Corporation, September 2001
10 Example of the need for Fault Tolerance Consider an application, decomposed into 100 services. Assume each service has a MTTR of 120 days and requires a week of computation. Assuming an exponentially distributed failure mode, the composed application would have a MTTF of 1.2 days. Without any kind of FT, the application would rarely finish.
11 Expressing Fault Tolerance Capabilities (1) Application metadata is critical for the development of Grid computing environments. Information captured by metadata supports discovery of applications in the Grid environment. It also facilitates the seamless composition of services. We are therefore seeking to create a standard way of expressing fault tolerance properties in service metadata.
12 Expressing Fault Tolerance Capabilities (2) This would allow a user to identify whether, for example, a service uses Recovery Blocks, Multi- Version Design or has no fault tolerance whatsoever. This information could then be used in both WSDL and Service Data Elements.
13 PC Grids (1) Perhaps one of the most attractive opportunities allowed by Grid technologies is the idea of the PC Grid PC PC Grid Server PC Client
14 PC Grids (2) The issue of providing fault tolerance in not as simple as it might initially appear. As the individual nodes on a PC Grid are all potentially insecure, it is evident that replication is the most suitable FT methodology to use. However, different nodes in the Grid will be running at different speeds, and have different loads at any one time. It thus becomes difficult to guarantee a job will be finished within a given amount of time.
15 PC Grids (3) Simple replication might therefore not be suitable, as the server may be left waiting for a heavily loaded node to finish and submit its results, while it already has the other nodes results. PC Grid Server 2 minutes 2 minutes 2 minutes 8 minutes 8 minutes replication block
16 PC Grids (4) In addition, PC Grids are highly dynamically – nodes may join or leave at any time. We therefore cant make any guarantees about the performance of each node. We can, however, make general assumptions about their ability to perform a job within a given time-frame, based on their hardware and historical load levels, etc. So here is an initial FT scheme we are currently looking at for providing replication on PC Grids…
17 Replication on a PC Grid (1) It may be the case that some jobs to be sent out on the PC Grid are more important than others. Ideally, we want the most important jobs (or the ones requiring most compute time) to be processed quickly. We also want to ensure that different replication blocks finish at approximately the same time, so that the PC Grid server isnt waiting to vote on jobs for too long.
18 Replication on a PC Grid (2) We also need to allow for the possibility of nodes within a replication block leaving the PC Grid (voluntarily or due to some kind of failure) Given that the resources available should be plentiful, we can therefore use more replication than we strictly need. We can then vote on the results of the first n nodes within a replication block that return results (with n being arbitrary) So we are thinking of something like the following…
19 Our Very Provisional Solution (1) Whenever a node joins the PC Grid, it must be assigned a performance category based on its hardware capabilities and load. These categories are dynamic, and continually re-assessed by the PC Grid server. 1 1 1 1 2 2 2 PC Grid Server
20 Our Very Provisional Solution (2) Similarly, when a job is submitted to a PC Grid, the scheduler must decide on the priority of the job. This may be based on whether the job requires lots of computation, or perhaps is critical, etc. It then identifies several nodes within the PC Grid that meet the performance requirements of the job, based on their performance category.
21 Our Very Provisional Solution (3) The server then sends the replicated job to several of these nodes, to form a replication block. Should resources allow, this block can contain more nodes than we need, in order to guard against some of them leaving/failing. We might specify that – should there be a lack of nodes within the given performance category – we use a mixture of nodes from other categories.
22 Example (1) Assume we have a PC Grid like this: Then assume, a job is submitted that is adjudged to have a priority of 1. 1 1 1 1 2 2 2 PC Grid Server
23 Example (2) We only need 3 nodes for decent replication, but we have 4 that are not used, so why not?! 1 1 1 1 2 2 2 PC Grid Server Job from Client Job of priority 1 The jobs replication block
24 Example (3) Part way through computation, one of the nodes either leaves the PC Grid or is reallocated to another replication block – but we still have 3 left. 1 1 1 2 2 2 PC Grid Server The jobs replication block
25 Example (4) As each node finishes its task, it sends its results back to the server, and is free to be allocated elsewhere. The server stores the results centrally and waits for the final job in the replication block to finish. 1 1 1 2 2 2 PC Grid Server The jobs replication block finished
26 Example (5) Because the nodes used were of similar performance, the server will not have to wait long and hence overhead should be kept down. Eventually, the final node finishes, and the result is voted on, and - if successful – sent back to the client. 1 1 1 2 2 2 PC Grid Server finished
27 Acceptance Tests on the Grid (1) Speaking of voting, another area of FT that is traditionally problematic is that of acceptance testing. This is where the result of a program/service is verified by one or more tests, performed automatically. A number of FT schemes depend on such testing, but the testing itself must usually be simple, as it otherwise introduces unacceptable run-time overhead.
28 Acceptance Tests on the Grid (2) With the Grid, this problem may be solvable for some applications. Rather than process the acceptance test locally, we could send the data and either an executable or a schema specifying the test to perform, to an HPC node. The overhead of compute time would thus be decreased, although whether this will be offset by the increase in communication overhead remains to be seen.
29 Conclusions (1) The e-Demand project is multi-faceted – its looking at security, visualisation, testing and fault tolerance. The main focus of this talk has been to present some ideas we have in regard to fault tolerance. FT is obviously needed on the Grid. A standard way of expressing FT capabilities in a services metadata would be A Good Thing. We are inclined to focus initially on the problem of providing FT to PC Grids.
30 Conclusions (2) At first glance, FT on a PC Grid simply involves replication, but it soon becomes apparent that a more optimal solution involves: Assessing and grouping PC Grid nodes. Assessing and scheduling jobs. Using extra redundancy to tolerate nodes leaving. Perhaps reallocating redundant nodes on the fly. Perhaps farming out computationally expensive acceptance tests to HPC nodes. Assessing scalability of this architecture. And so on!
31 Open Issues Obviously, this work is still in its initial stages and so there are many things that need to be considered, such as: Cost (will using extra PC Grid nodes be chargeable?) How to choose nodes for replication blocks How to dynamically assess node performance How to dynamically assess job priorities What to do if consensus is not reached, or only 1 node successfully returns the job Whether a traditional distributed system fault model is applicable in a grid environment, or whether revision is needed.
32 Open Issues (2) The traditional distributed systems fault model includes events such as: Physical faults Software faults Timing faults Communication faults Life-cycle faults Is such a traditional fault model acceptable for Grid computation, or is some revision required?
33 Thanks! If you have any questions or anything, then e-mail: firstname.lastname@example.org And you might even get a reply!