Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov.

Similar presentations


Presentation on theme: "Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov."— Presentation transcript:

1 Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov

2 Biological Sequence Alignment LocalGlobal Goal Algorithm Application To identify conserved regions and differences To see whether 2 strings have a common substring Needleman-WunschSmith-Waterman Comparing two genes with same function (human vs. mouse) Comparing two proteins with similar function Searching for local similarities in large sequences (newly sequenced genomes) Looking for motifs in 2 proteins

3 Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESG SVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKD GKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAV VARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIM LKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRW SVVSNGDVECTVVDETKDCIIKIMKGEADAV

4 Protein Responsible for Iron Transport Human MQEYTNHSDTTFALRNISFRVPGRTLLHPLSLTFPAGKVTGLIGHNGSGKSTLLK MLGRHPPSEGEILLDAQPLESWSSKAFARKVAYLPQQLPPAEGMTVRELVAIGR YPWHGALGRFGAADREKVEEAISLVGLKPLAHRLVDSLSGGERQRAWIAMLVA QDSRCLLLDEPTSALDIHQVDVLSLVHRLSQERGLTVIAVLHDINMAARYCDYL VALRGGEMIAQGTPAEIMRGETLEMIYGIPMGILPHPAGAAPVSFVY Chicken MKLILCTVLSLGIAAVCFAAPPKSVIRWCTISSPEEKKCNNLRDLTQQERISLTCV QKATYLDCIKAIANNEADAISLDGGQVFEAGLAPYKLKPIAAEIYEHTEGSTTSY YAVAVVKKGTEFTVNDLQGKTSCHTGLGRSAGWNIPIGTLLHWGAIEWEGIESG SVEQAVAKFFSASCVPGATIEQKLCRQCKGDPKTKCARNAPYSGYSGAFHCLKD GKGDVAFVKHTTVNENAPDLNDEYELLCLDGSRQPVDNYKTCNWARVAAHAV VARDDNKVEDIWSFLSKAQSDFGVDTKSDFHLFGPPGKKDPVLKDLLFKDSAIM LKRVPSLMSQLYLGFEYYSAIQSMRKDQLSGSPRQNRIQWIAVLKAEKSKCDRW SVVSNGDVECTVVDETKDCIIKIMKGEADAV

5 Similar Substrings DSLSGGERQ–RA–WIAMLVAQDSRC : : : : : : : DQLSGSPRQNRIQWIAVLKAEKSKC

6 Talk Outline Problem Description Smith-Waterman Algorithm BLAST ParAlign TurboBLAST Comparison

7 Problems of Comparison of 2 Sequences Evolution Factor Additions Deletions Substitutions Human Factor Typos Duplicates

8 Solution Smith-Waterman Algorithm (S-W) Score Matrix Gap Penalty

9 Score Matrix: BLOSUM45

10 Pairwise Alignment Example ELEPHANT PANTHER

11 S-W: Dynamic Programming Matrix

12 S-W: Formula T[i-1, j-1] + score(s[i], t[j]) T[i, j] = max T[i-1, j] – g T[i, j-1] – g 0 g – gap penalty g = 8 (in our example) T[i-1, j-1] T[i, j-1] T[i-1, j] ?

13 S-W: Dynamic Programming Matrix

14

15

16

17 S-W: Result Alignment ELEPHANT : : : : P– ANTHER

18 S-W: Summary Uses Score matrix Gap penalties Complexity O(mn) Sensitivity High

19 http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html ~ 33 mln sequences as of Feb. 14, 2004 Growth of GenBank

20 BLAST: Basic Local Alignment Search Tool

21 BLAST: Steps Divide both sequences into words of length w default w = 3 Calculate score for each pair Extend high scored pairs to increase score

22 BLAST: Divide Sequences

23 BLAST: Calculate Score E L E P A N 0 -1 0 score: -1 L E P P A N -3-1-2 score: -6 E P H P A N 0 -1 1 score: 0 P H A P A N 9 -2-1 score: 6 H A N P A N -2 5 6 score: 9 A N T P A N -1 -1 0 score: -2

24 BLAST: Sort Pairs on Score

25 BLAST: Extension

26 BLAST: Summary Uses Score matrix Gap penalties Heuristics to reduce computations Complexity O(m) with O(n) processors Sensitivity Low

27 Sensitivity AXBXCXDXE ABCDE Task: Align 2 sequences: Smith-Waterman: BLAST: AXBXCXDXE : : : : : A– B– C– D– E Ø (no similar substrings)

28 S-W vs. BLAST Speed Sensitivity S-W BLAST

29 S-W and BLAST Using them now Too costly Inefficient Time-consuming Solution More heuristics More parallelism

30 ParAlign

31 ParAlign: Steps Find ungapped alignments Calculate approximate alignment scores Choose high-scored sequences Apply S-W

32 ParAlign: Microparallelism Divide wide registers into smaller units Perform the same operation on different data sources Modern microprocessors have this technology built in

33 ParAlign: Calculate Scores in Parallel

34 ParAlign: Estimate of Gaps

35 ParAlign: Apply S-W in Parallel

36 ParAlign: Summary Uses SIMD technology (single instruction multiple data) S-W Algorithm Heuristics to reduce computations Requirement for machine Modern microprocessor Speed Fast Sensitivity Medium

37 TurboBLAST

38 TurboBLAST: Steps Divide the job Parts of query against partition of database Apply BLAST Merge results

39 TurboBLAST: Implementation A three-tier system Components Client Master Workers

40 TurboBLAST: Schema Master Client Workers tasks job task Divide task Schedule subtasks Solve subtasks Merge results Turbo Hub DB request File Provider DB part Sets up tasks Manages execution Coordinates Workers Provides VSM Divides job into tasks Writes results to file results request task It does it not by pushing the work out, but rather by simply posting information about what work needs to be done and letting the machines grab work from the remote locations.

41 TurboBLAST: Client Takes a BLAST job and divides it into a number of initial BLAST tasks. Submits these tasks to the Master Retrieves the results, and writes them to file.

42 TurboBLAST: Master Accepts tasks from Clients and sets them up to for processing by the Workers Includes TurboHub (the server portion of a parallel execution system) Includes File Provider (Java application that manages the databases)

43 TurboBLAST: Worker Workers are processors Run a Java application and perform the BLAST computations Merge the result Are responsible for scheduling

44 TurboHub TurboHub is execution engine for parallel and distributed Java applications Scalable high performance Wide range of computing environments Manages the flow of data through the workflows Schedules the components Transforms data between components Balances load Handles errors

45 TurboBLAST: TurboHub Manages task execution Coordinates the Workers Provides a virtual shared memories Supports dynamic changes in the set of Workers Supports fault tolerance

46 TurboBLAST: File Provider Maintains a copy of each database Delivers all or part of each database to Workers as they require them

47 TurboBLAST: Advantages Size of each task is optimal processing is efficient on the processor that computes the task Large set of tasks no waste of time for processors No algorithm change Support for all flavors of BLAST Ease to update Applicable for different environments (PC, Macintosh …)

48 TurboBLAST: Experiment Input data 500 proteins 200 – 400 amino acids in each Database 1,681,522,266 sequences Hardware IBM Linux cluster 8 dual-processor workstations 2 Pentium III processors, 996 Mhz each 2 Gbyte memory 100 Mbit Ethernet

49 TurboBLAST: Results of Experiment

50

51 TurboBLAST: Summary Divide and Conquer Use many copies of BLAST in parallel Uses BLAST Algorithm Requirement for each machine Java VM Local BLAST executable Speed Very fast Sensitivity Low

52 Comparison of Algorithms/Products Speed Sensitivity S-W BLAST ParAlign Turbo BLAST

53 References R.D. Bjornson, A.H. Sherman, S.B. Weston, N. Willard, J. Wing “TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub” Intl. Parallel and Distributed Processing Symposium (IPDPS), 2002. Rognes T. “ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches” Oxford University Press, 2001

54 Don’t ask any Questions, please…

55 PS Web site there you can donate your computer time to participate in search of methods to cure cancer: http://www.the-optimists.org.uk


Download ppt "Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov."

Similar presentations


Ads by Google