Presentation is loading. Please wait.

Presentation is loading. Please wait.

Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Computational Biochemistry.

Similar presentations


Presentation on theme: "Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Computational Biochemistry."— Presentation transcript:

1 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Computational Biochemistry

2 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Proteins, DNA, etc. DNA encodes the information necessary to produce proteins Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)

3 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Proteins are formed from a chain of molecules called amino acids Proteins, DNA, etc.

4 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc The DNA sequence encodes the amino acid sequence that constitutes the protein Proteins, DNA, etc.

5 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I,... Proteins, DNA, etc.

6 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Multiple Sequence Alignment

7 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Databases of Biological Sequences >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH NCBI: 14,976,310 sequences 15,849,921,438 nucleotides Swiss-Prot: 104,559 sequences 38,460,707 residues PDB: 17,175 structures

8 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Sequence comparison Compare one sequence (target) to many sequences (database search) Compare more than two sequences simultaneously

9 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Applications Phylogenetic analysis Identification of conserved motifs and domains Structure prediction

10 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc

11 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Phylogenetic Analysis

12 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Structure Prediction Genomic sequences > RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein sequences Protein structures

13 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Clustal W

14 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Progressive Alignment Scerevisiae [1] Celegans [2] 0.640 Drosophia [3] 0.634 0.327 Human [4] 0.630 0.408 0.420 Mouse [5] 0.619 0.405 0.469 0.289 S.cerevisiae C.elegans Drosophila Mouse Human 1. Do pairwise alignment of all sequences and calculate distance matrix 2. Create a guide tree based on this pairwise distance mat 3. Align progressively following guide tree. start by aligning most closely related pairs of sequences at each step align two sequences or one to an existing subalignment

15 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Clustal Parallel pairwise (PW) alignment matrix Parallel guide tree calculation Parallel progressive alignment Scerevisiae [1] Celegans [2] 0.640 Drosophia [3] 0.634 0.327 Human [4] 0.630 0.408 0.420 Mouse [5] 0.619 0.405 0.469 0.289 S.cerevisiae C.elegans Drosophila Mouse Human

16 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Clustal - Improvements Optimization of input parameters –scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters. Minimum Vertex Cover –use minimum vertex cover to remove erroneous sequences, and identify clusters of highly similar sequences.

17 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Minimum Vertex Cover Conflict Graph –vertex: sequence –edge: conflict (e.g. alignment with very poor score) TASK: remove smallest number of gene sequences that eliminates all conflicts

18 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc FPT Algorithms Phase 1: Kernelization Reduce problem to size f(k) Phase 2: Bounded Tree Search Exhausive tree search; exponential in f(k)

19 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Kernelization Buss's Algorithm for k-vertex cover Let G=(V,E) and let S be the subset of vertices with degree k or more. Remove S and all incident edges G->G’ k -> k'=k-|S|. IF G' has more than k x k' edges THEN no k-vertex cover exists ELSE start bounded tree search on G'

20 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Bounded Tree Search

21 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Case 1: simple path of length 3 remove selected vertices from G' k' - = 2

22 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Case 2: 3-cycle remove selected vertices from G' k' - = 2

23 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Case 3: simple path of length 2 remove v1, v2 from G' k' - = 1

24 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Case 4: simple path of length 1 remove v, v1 from G' k' - = 1

25 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Sequential Tree Search Depth first search –backtrack when k'=0 and G'<>0 ("dead end" )) –stop when solution found (G'={}, k'>=0 )

26 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Tree Search Basic Idea: –Build top log p levels of the search tree (T ') –every proc. starts depth-first search at one leaf of T ' –randomize depth-first search by selecting random child

27 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Analysis: Balls-in-bins sequential depth-first search path total length:L, #solutions: m expected sequential time (rand. distr.): L/(m+1) parallel search path expected parallel time (rand. distr.): p + L/(p(m+1)) expected speedup: p / (1 + (m+1)/L) if m << L then expected speedup = p

28 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Simulation Experiment L = 1,000,000

29 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Implementation test platform: –32 node Beowulf cluster –each node: dual 1.4 GHz Intel Xeon, 512 MB RAM, 60 GB disk –gcc and LAM/MPI on LINUX Redhat 7.2 code-s: Sequential k-vertex cover code-p: Parallel k-vertex cover

30 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc HPCVL High Performance Computing Virtual Laboratory - HPCVL (www.hpcvl.org) Created by parallel computing researchers from Carleton U. (Comp. Sci.) Queen's (Engineering) Ottawa U. (Life Sci./Hospital) Obtained $30M+ in Federal (CFI) and Ontario (OIT, ORDCF) grants

31 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data Protein sequences Same protein from several hundred species Each protein sequence a few hundred amino acid residues in length Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)

32 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data Somatostatin –neuropeptide involved in the regulation of many functions in different organ systems –Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255

33 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data WW –small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling –Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318

34 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data Kinase –large family of enzymes involved in cellular regulation –Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397

35 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data SH2 (src-homology domain 2) –involved in targeting proteins to specific sites in cells by binding to phosphor- tyrosine –Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397

36 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data Thrombin –protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin –Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413

37 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data PHD (pleckstrin homology domain) –involved in cellular signaling –Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603

38 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data Random Graph |V| = 220, |E| = 2155, k = 122, k' = 122 Grid Graph |V| = 289, |E| = 544, k = 145, k' = 145

39 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Test Data |VC| ~ |V| / 2 k' = k

40 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Sequential Times Kinase, SH2, Thombin: n/a

41 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Code-p on Virtual Proc.

42 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Times

43 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Speedup: Somatostatin

44 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Speedup: WW

45 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Speedup: Rand. Graph

46 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Speedup: Grid Graph

47 Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Thank You! Questions?


Download ppt "Faculty of Computer Science Dalhousie University, Canada Andrew Rau-Chaplin, www.cs.dal.ca/~arc Parallel Computational Biochemistry."

Similar presentations


Ads by Google