1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain

2 Outline Problem Solution Architecture Implementation Examples

3 Application porting for distributed platforms

4 Problem Application should be highly portable Grid: Schedulers: WMS, GridWay, Pilot Jobs... Libraries: DRMAA, SAGA,... Cluster Schedulers: SGE, PBS/Torque,... Libraries: DRMAA, MPI

5 A new standard? (author: xkcd.com)

6 Solution: distributedToolbox

7 High level description Distributed tasks are defined with a reduced set of parameters and exported as XML files Executable, arguments, input/output/error files XML files are parsed and tasks executed on the distributed infrastructures Depending on the infrastructure, this can be done on very different ways

8 High level description (2) The basic idea is NOT to define a new standard, libraries, API... BUT Create a simple specification that anyone can implement according to their specific needs Extremely simple or rather complex!

9 Application developer’s point of view Java and python APIs are included to create distributed task definition files (XMLs), and to load information from XMLs If needed, others can be seamlessly implemented

10 DistributedToolbox Set of tools to execute distributed tasks Implementations for Cluster & Grid Can be modified or adapted to new platforms on a very simple way

11 Proposed solution for clusters

12 Proposed solution for the Grid

13 Execution workflow Local application creates task XMLs TaskLoader reads these files and stores them on a database GridController reads this database and executes the tasks employing GridWay A task is considered finished when the desired output files exist and are not null Local application loads results and finishes its execution

14 Robustness Certification problems. If the user is not able to properly identify himself by employing a valid Grid Certificate, GridWay will detect it and abort the task submission, notifying the problem. Communication failures. If any kind of problem on the transmission of the input data or task executable occurs, it is detected by GridWay on the remote site and the task is cancelled. If any kind of problem on the transmission of the output data occurs and this data is not returned to the local host, the task is considered to have failed.

15 Robustness (2) Local resource failures. If the specified input files are not present on the system, the job is considered as finished. If communication with GridWay is broken the task submission is stopped. When communication is restored, the status of the tasks being run is checked. If GridController fails, no information is lost due the employment of databases for persistence. When it is restarted, previous state is recovered and the status of the tasks that were running is checked. If the database fails, the execution of GridController is considered to be unsafe and automatically stops.

16 Robustness (3) Remote resource failures. If the remote task does not start, GridWay detects it. If the remote task remains in a queue for more than a given threshold, it is resubmitted. If there is any problem with the Grid certificates on the remote site, it is detected by GridWay. Some failures in remote sites lead to an state where the master node thinks that the task is running even if it was finished on the worker node. To detect this, tasks with an extremely long execution time are considered to have failed. In order to avoid performance slowdowns, a small replication factor for every group of tasks has been included. 16

17 Use Cases

18 ProtTest3 1 & jModelTest2 2 Java applications, designed to run on local workstations Wrappers of a serial application, PhyML, that takes 99% of the computational effort Large cases take days to weeks Porting to HPC & Grid necessary to improve throughput [1] D. Darriba, G. L. Taboada, R. Doallo, and D. Posada. ProtTest 3: fast selection of best-fit models of protein evolution. Bioinformatics, 2011 [2] D. Darriba, G. L. Taboada, R. Doallo, and D. Posada. jModelTest 2: more models, new heuristics and parallel computing. Nature Methods, 9(8):772–772, July 2012. 18

19 Architecture of the solution

20 Results: reliability tests Tests for certificate management: Submit jobs with no certificate Submit jobs with a certificate of a different VO Submit jobs finishing after the certificate Manually destroying the certificate

21 Results: reliability tests (2) Tests on local resource: Kill GridWay... or any number of GridWay tasks Kill GridController Kill database Shutdown machine, both controlled and “hard reset”

22 Results: reliability tests (3) Tests on remote sites: Jobs not creating the desired output data Many tasks submitted to fusion and Biomed VOs to test the proposal on production environments

23 Results Tasks executed: Cluster: about 10.000 Grid: more than 100.000 Not a single one was lost or miss-worked

24 Thanks for your attention Questions?

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Similar presentations

Presentation on theme: "1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Similar presentations

Presentation on theme: "1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain."— Presentation transcript:

Similar presentations

About project

Feedback