Presentation is loading. Please wait.

Presentation is loading. Please wait.

How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local.

Similar presentations


Presentation on theme: "How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local."— Presentation transcript:

1 How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local BLAST Service on Workstation Clusters - Braun, Pedretti, Casavant, Scheetz, Birkett, Roberts

2 Overview 1.A Unique Approach to Information Gathering 2.Types of Architecture Used 3.Software Packages Used 4.How Parallelism is used in BLAST a.Background b.Granularity c.Sequence to Sequence Comparison d.Parallelization of Single Query across Partitioned Database e.Partitioned Set Of Queries Across Set of Servers

3 A Unique Approach to Information Gathering ASK SOMEONE !!

4 Alejandro Shaffer Says… Parallel Computing is Used to Analyze: Protein Sequence Data DNA Sequence Data Protein Structure Data Genetic Inheritance Data - Among Others

5 Alejandro Shaffer Says… Parallel Bioinformatics Computations are Run on the Followng Architectures: Small Shared Memory Multiprocessor Loosely coupled network of processors

6 Alejandro Shaffer Says… The assembly of the Human Genome was done on the loosely coupled network of computers.

7 Alejandro Shaffer Says… Two Software Packages Used by Bioinformaticists That Run On Parallel Computers: BLAST FASTLINK

8 BLAST Analyzes Protein or DNA Sequences: Takes input sequences and searches large databases for similar sequences.

9 FASTLINK Used to hunt the approximate chromosomal location of disease causing genes. - leaving this topic open for someone else to research.

10 BLAST Basic Local Alignment Search Tool The most common Sequence Comparison tool.

11 BLAST Three Parallel Components to BLAST 1.Sequence to Sequence Comparison Level 2.Parallelization of a single query across a distributed database 3.A set of queries is partitioned across a set of servers with either a replicated or partitioned database.

12 BLAST At the time the paper was published – December 15, 1999 – the only completed implementation was the third step: Parallelizing Batch Requests

13 First – Some Background “The basic nature of the entire process of gene discovery is highly parallel, heterogenous, and distributed.”

14 Background At the time of the publishing of this paper, the current mode used by 90% of researchers is to submit single queries for comparison of sequence data (300-600 chars) against one or more databases (GenBank)

15 Background Paper predicted that once the human genome was finished, the frequency and intensity of inquiries against data will increase exponentially. We’ve all seen the graph (several times) that proves this is true.

16 Background Problems: 1)Cluster of servers continues to diminish in its ability to serve the increasing number of requests. 2)Network traffic is becoming intolerable. 3)Database is growing at increasing rate. 4)Single queries are time consuming.

17 Refresher … Granularity Defined as the size of the computation between communication or synchronization points. Course – Each process contains a large number of sequential instructions and takes a substantial time to execute. Fine – Each process consists of a few, or even one instruction. Medium – Middle ground.

18 Refresher Granularity - Granularity is related to the number of processors being used. Metric Computation/Communication ratio = tcomp/tcomm Important to maximize ratio while maintaining parallelism

19 Three levels of Parallelism Exploitable in BLAST 1 sequence 1 sequence N sequences (batch request) 1 sequence 1 sequence M sequences M sequences (in database) (in database) M sequences (in database) Mult. alignments on single sequence pairs Partition database Multiple targets examined at once Replicate Database – Partition input sets Fine GrainedMedium GrainedCourse Grained Subject(s) Target(s) Parallelism

20 BLAST BLAST is a heuristic search algorithm Heuristic: Process of elimination and compromise by using the “what if” theory. An educated guess that reduces or limits the search for solutions. A method of solving problems by intelligent trial and error.

21 BLAST Five variations of BLAST blastn blastx tblastx blastp tblastn

22 BLAST blastn Compares a nucleotide sequence against a nucleotide database (Relatively quick)

23 BLAST blastx Compares a nucleotide sequence against a protein database. Nucleotide “subject” needs to be translated into a peptide sequence – since 6 different translations, the basic blast algorithm must be applied 6 times.

24 BLAST tblastx Compares nucleotide sequence to nucleotide database, only each is translated (in all 6 reading frames) into a peptide sequence before blasting. This is the most computationally intesive BLAST algorithm – must be invoked 36 times for each sequence to sequence comparison.

25 BLAST blastp Compares a peptide sequence to a peptide database (Relatively quick)

26 BLAST tblastn Compares a peptide sequence against a nucleotide database Requires 6 calls to BLAST

27 BLAST Benefits of Parallelizing Local BLAST Reduces processing time in relation to number of compute nodes utilized. Reduces costs by utilizing commodity workstations and PCs. A locally-scheduled parallel algorithm allows prioritization and control over individual searches.

28 Types Of Parallelism I. Pairwise Multiple Alignment Fancy term for earlier description of variations on BLAST algorithm. Since the comparisons are mutually independent, the parallelization of the comparisons is potentially very efficient. Of greatest importance would be a high-speed, low- latency interconnection network to allow rapid selection and scoring of the best possible alignment. Effective implementation would greatly benefit from specialized hardware.

29 Types of Parallelism II. Database Partitioning Distributing chunks of the database across a collection of compute nodes. Master node coordinates the scheduling of jobs and collates the results from each submission.

30 Types of Parallelism III. Batch Mode Scheduling sets of queries, while keeping full copies of the database stored on each compute node. This type of parallelism is currently in place and being used.

31 Batch Mode The foundation of the local batch BLAST system is the Portable Batch System developed for NASA. PBS is comprised of three parts: The Job Server The Scheduler Compute Nodes

32 Batch Mode The Job Server Responsible for managing two queues of incoming jobs – one for batch blast jobs, the other for jobs interactively submitted to local BLAST through a web interface.

33 Batch Mode The Scheduler Applies job scheduling algorithm to allocate compute nodes to jobs in the two incoming job queues. Some nodes have several CPUs and can handle more than one simultaneous blast job. The scheduler assigns multiple jobs to such nodes.

34 Batch Mode Compute Nodes Each node has a monitor that communicates with job server. Each node has own set of sequence databases.

35 Job Types 1)Batch jobs – can be executed at any time and restarted if necessary. 2) Interactive jobs – time critical and should have priority over batch jobs.

36 Job Types At time paper was published, the current implementation was as follows: 75% of compute nodes execute batch jobs 25% always available for interactive web jobs. if no batch jobs, all 100% are available for web jobs – neither type of job will be starved of resources with this approach.

37 Issues with Batch Mode All replicated databases must be updated periodically to reflect the most recent contents of globally shared db. All nodes copies must be consistent with one another. Otherwise, results of the query would depend on which compute node processed it.

38 Considerations… A Networked File System is being considered where there would be several I/O servers in a system, each with a complete copy of database. Compute nodes would rely on these I/O servers for access to database.

39 Next Step… The partitioned database implementation will utilize many of the concepts developed for the course-grained implementation, but the scheduler would need to know which nodes had which section of the database.

40 Next Step… Outputs would then need to be combined into single output file This is Non-Trivial Merge program must parse, sort, and correct data from nodes, and E values must be corrected to reflect larger database size.

41 Questions ??? One of My Own Since this paper was published in 1999, have all three levels of parallelism described here been exploited by now? - haven’t found the answer.


Download ppt "How Parallelism Is Used In Bioinformatics Presented by: Laura L. Neureuter April 9, 2001 Using: Three Complimentary Approaches to Parallelization of Local."

Similar presentations


Ads by Google