Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

Similar presentations


Presentation on theme: "Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn."— Presentation transcript:

1 Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman

2 CMSC 838T – Presentation Motivation u Biological databases are growing at a very high rate  Protein Data Bank (PDB) increased from 5811 entries to 12110 in three years u Computational tools required to efficiently access and analyze this data  Typical data analyses l Linear scans across database looking for something l “all-versus-all” comparisons within database u High performance distributed computing resources can play important role in these analyses  Authors use a distributed computing environment, L EGION, to enable large scale analysis on PDB

3 CMSC 838T – Presentation Motivation u Similar to evaluation of threaded-blast project  We run threaded blast over Sun SMP with 24 processors u Authors run program called F EATURE over L EGION framework  Can access hundreds of CPUs worldwide  Can spawn sequential versions of F EATURE on all of them

4 CMSC 838T – Presentation Talk Overview u Overview of talk  Motivation  Background l L EGION l F EATURE  Methods l Experiments  Results  Discussions  Related work  Observations

5 CMSC 838T – Presentation Background u L EGION (Worldwide Virtual Computer)  Metacomputing environment comprised of geographically distributed, heterogeneous collections of workstations and supercomputers  Connects resources to make up a single, worldwide, virtual computer  Coordinates large number of parallel jobs on a mixture of processors SMPs, MPPs, PCs on any network  Legion provides the software infrastructure so that a system of heterogeneous, geographically distributed, high performance machines can interact seamlessly.  No manual installation of binaries over multiple platforms (L EGION does it automatically)

6 CMSC 838T – Presentation Background u L EGION  LAM - MPI implementation for workstation clusters  Legion supports transparent scheduling, data management, fault tolerance, site autonomy, single file name space, efficient scheduling comprehensive resource management, and a wide range of security options.

7 CMSC 838T – Presentation Background u F EATURE  Site characterization and recognition system l Site is a microenvironment distinguished by some structural or functional role  Identifies functional or structural sites of interest in query protein

8 CMSC 838T – Presentation Background u F EATURE  Measures spatial distributions of chemical and physical properties to create statistical model of microenvironment  Compares regions of query protein with known sites and control non- sites and assigns scores indicating likelihood of region being site  Produces list of potential sites locations with corresponding scores  Has been used to recognize ion, ligand and enzyme binding sites  FEATURE is typical data-driven algorithm requiring large data storage and efficient data analysis  Requires 12 hours on single processor to evaluate 580 non-redundant PDB entries

9 CMSC 838T – Presentation Methods u F EATURE run on all protein entries in May 2000 PDB u Searched for potential Calcium binding sites  F EATURE has 90% sensitivity and 100% specificity to this u Three experiments conducted  Sequential scan of PDB subset using single processor  Comprehensive scan of PDB using L EGION system using 50 processors  Set of runs of L EGION using constant PDB subset but varying processors u Input parameters to F EATURE and statistical model for Ca remained constant

10 CMSC 838T – Presentation Methods u Experiments  Sequentially scanned arbitrary 726 proteins from PDB l Runs made on single processor Sun E450 machine with 300 MHz Ultra-Sparc CPU  Comprehensive scan of all proteins (10,996 total) in PDB l Maximum # of processors: 50 l F EATURE code compiled for various platforms so binaries can be run on different machines across L EGION  Scanned subset of proteins with varying number of processors l Arbitrarily selected 4997 proteins for each run l Varied number of processors using values 20, 40, 60, and 80

11 CMSC 838T – Presentation Results u F EATURE reported six run time failures due to non-standard PDB file formats for sequential run u F EATURE also run time assertion failures, illegal instructions or segmentation faults during second experiment

12 CMSC 838T – Presentation Results

13 CMSC 838T – Presentation Discussion u F EATURE performance deteriorates after # of processors exceeds 60  Optimal max number is constrained by l client’s process table which keeps track of each L EGION process spawned l amount of memory available to support spawned processes  Thus even if L EGION contains 100s of nodes, users cannot use them u Also L EGION provides minimal fault-tolerance (if any instance fails user must wait till everything has finished to re-spawn) u Authors maintained local copy of database but concede that this is not realistic situation as  updates to PDB occur frequently  Consumes lot of disk space

14 CMSC 838T – Presentation Related Work u Threaded BLAST and MPI Blast  Authors work is similar to threaded blast  MPI Blast is a parallelized version of Blast so single query can be split across multiple processors  F EATURE is not truly parallelized

15 CMSC 838T – Presentation Observations u Running CPU intensive tasks over many processors is definitely useful  However, L EGION does not scale well as there is performance degradation after 60 processors u They have not utilized true parallelism in F EATURE  It seems to me that there is lot of potential to parallelize F EATURE given that many potential sites can be examined simultaneously  What is performance enhancement in parallelized version?

16 CMSC 838T – Presentation Questions


Download ppt "Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn."

Similar presentations


Ads by Google