Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.

Similar presentations


Presentation on theme: "1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June."— Presentation transcript:

1 1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June 17, 2006

2 2 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Outline  Computing intensive sequence alignment  Mapping the problem to the Grid  Prototype implementation & results

3 3 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Introduction  Sequence alignment based on profile-HMM: popular but CPU intensive  The problem can easily be parallelised:  Embarrassingly parallel problem domain  Ideal for a Data Grid with lots of CPU power

4 4 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch HMM in Biology  Originally, HMMs have been mainly used in speach recognition  In Biology used for sequence alignment and database search  Packages like  HMMER, PFTOOLS, SAM, etc.  Profile-HMMs are stored in databases like Prosite, PFAM, SMART, etc.

5 5 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch HMM on the Grid  Gridification of the scenario: 1. Input dataset needs to be split (pre-processing) 2. Workload generation 3. Grid jobs submission 4. Remote execution (CPU intensive part) 5. Merging of results hmmsearch model.hmm database.seq Example:

6 6 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch HMM on the Grid Grid Storage Element (SE) 4. Remote execution on computing elements (CE) Local Desktop 1.Data pre-processing 2.Creation of job descriptors 3.Job submission 4. -> GRID 5.Merging of results

7 7 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Profiles 1..nnn Seqs 1..zzzzz Input Files Profiles 1..100 Chunks Profiles 101..200 Profiles kkk..nnn Seqs 1..10’000 Seqs 10’001..20’000 Seqs yyyyy..zzzzz … … Grid Storage Element (SE) Profiles 1..100 Seqs 1..10’000 Profile 1 Profile 2 Profile 3 Profile 100 … wg on Remote Site wg on Local Site Store files on SE Get files from SE hmmsearch

8 8 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Prototype Implementation  Prototype code implemented in C++ to wrap hmmseach as well as pfsearch  Other applications are possible, too  Client side code:  Uses EGEE as the main Grid middleware  Within Swiss BioGrid: ARC/NorduGrid  The tool also runs on LSF (Vital-IT cluster)  Globus needs to be installed on local machine  Takes care of job creation, remote execution, resubmission etc.

9 9 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Usage Example for “wg” (workload generator) Usage: wg [options] path of the profile-HMM file containing 1 or more profiles. path of the sequence file containing 1 or more sequences. The following options are available: -p number of parallel jobs submitted to the Grid. -rused for a remote execution of the program. -hprint this help message. -s retrieve all status information of submitted jobs. -v verbose: print debug messages. -O do not retrieve the job output but only display status. -P use pfsearch (default). -H use hmmsearch.

10 10 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Benchmarks  Several benchmarks to check for:  Functionality (correctness)  Performance  Performance “prediction”  Preliminary benchmark dataset:  Profile-HMM DB with 7,868 entries (619 MB)  Sequence DB with 10,923 entries ( 7.4 MB)

11 11 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Preliminary Results hmmsearch Number of processors (parallel jobs) Execution time in hours

12 12 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Distribution of Job Execution Time 100 parallel jobs Not really “high performant” Need to get rid of peaks

13 13 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Run Time Sensitive (RTS) Scheduling/Execution Task Server Storage Element task 1, task 2,..., task n Worker Node with running job 1. Get Task 2. return task URL 3. retrieve task Task done

14 14 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Execution time of RTS-Algo. Time in hours 0 100 200 300 400 Sequence number of work unit Overall performance: 2.5 hours Similar to performance on local cluster

15 15 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Execution time on LSF 1hour 2hours 3hours

16 16 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Comparison

17 17 Large-Scale Profile-HMM on the Grid Laurent.Falquet@isb-sib.ch Conclusion  Gridification works well for the selected problem  High performance is achieved via run time sensitive scheduling algorithm  Heterogeneous Grids can gain comparable performance to homogeneous clusters The EMBRACE project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT-2004-512092. Work done by Heinz Stockinger in co-operation with Marco Pagni, Lorenzo Cerutti and Laurent Falquet


Download ppt "1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June."

Similar presentations


Ads by Google