Presentation is loading. Please wait.

Presentation is loading. Please wait.

Frank Dehnewww.dehne.net Parallel Computational Biochemistry.

Similar presentations


Presentation on theme: "Frank Dehnewww.dehne.net Parallel Computational Biochemistry."— Presentation transcript:

1 Frank Dehnewww.dehne.net Parallel Computational Biochemistry

2 Frank Dehnewww.dehne.net Proteins, DNA, etc. DNA encodes the information necessary to produce proteins Proteins are the main molecular building blocks of life (for example, structural proteins, enzymes)

3 Frank Dehnewww.dehne.net Proteins are formed from a chain of molecules called amino acids Proteins, DNA, etc.

4 Frank Dehnewww.dehne.net The DNA sequence encodes the amino acid sequence that constitutes the protein Proteins, DNA, etc.

5 Frank Dehnewww.dehne.net There are twenty amino acids found in proteins, denoted by A, C, D, E, F, G, H, I,... Proteins, DNA, etc.

6 Frank Dehnewww.dehne.net Multiple Sequence Alignment

7 Frank Dehnewww.dehne.net Databases of Biological Sequences >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH NCBI: 14,976,310 sequences 15,849,921,438 nucleotides Swiss-Prot: 104,559 sequences 38,460,707 residues PDB: 17,175 structures

8 Frank Dehnewww.dehne.net Sequence comparison Compare one sequence (target) to many sequences (database search) Compare more than two sequences simultaneously

9 Frank Dehnewww.dehne.net Applications Phylogenetic analysis Identification of conserved motifs and domains Structure prediction

10 Frank Dehnewww.dehne.net

11 Frank Dehnewww.dehne.net Phylogenetic Analysis

12 Frank Dehnewww.dehne.net Structure Prediction Genomic sequences > RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Protein sequences Protein structures

13 Frank Dehnewww.dehne.net Our Contributions Parallel min vertex cover for improved sequence alignments (to appear in Journal of Computer and System Sciences) Parallel Clustal W (ICCSA 2003) In progress: “Clustal XP” portal at http://cgm.dehne.net

14 Frank Dehnewww.dehne.net Clustal W

15 Frank Dehnewww.dehne.net Progressive Alignment Scerevisiae [1] Celegans [2] 0.640 Drosophia [3] 0.634 0.327 Human [4] 0.630 0.408 0.420 Mouse [5] 0.619 0.405 0.469 0.289 S.cerevisiae C.elegans Drosophila Mouse Human 1. Do pairwise alignment of all sequences and calculate distance matrix 2. Create a guide tree based on this pairwise distance matrix 3. Align progressively following guide tree. start by aligning most closely related pairs of sequences at each step align two sequences or one to an existing subalignment

16 Frank Dehnewww.dehne.net Parallel Clustal Parallel pairwise (PW) alignment matrix Parallel guide tree calculation Parallel progressive alignment Scerevisiae [1] Celegans [2] 0.640 Drosophia [3] 0.634 0.327 Human [4] 0.630 0.408 0.420 Mouse [5] 0.619 0.405 0.469 0.289 S.cerevisiae C.elegans Drosophila Mouse Human

17 Frank Dehnewww.dehne.net Relative Speedup

18 Frank Dehnewww.dehne.net Clustal XP vs. SGI SGI data taken from Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal, and MULTICLUSTAL By: Dmitri Mikhailov, Haruna Cofer, and Roberto Gomperts

19 Frank Dehnewww.dehne.net Parallel Clustal - Improvements Optimization of input parameters –scoring matrices, gap penalties - requires many repetitive Clustal W calculations with various input parameters. Minimum Vertex Cover –use minimum vertex cover to remove erroneous sequences, and identify clusters of highly similar sequences.

20 Frank Dehnewww.dehne.net Minimum Vertex Cover Conflict Graph –vertex: sequence –edge: conflict (e.g. alignment with very poor score) TASK: remove smallest number of gene sequences that eliminates all conflicts NP-complete

21 Frank Dehnewww.dehne.net FPT Algorithms Phase 1: Kernelization Reduce problem to size f(k) Phase 2: Bounded Tree Search Exhausive tree search; exponential in f(k)

22 Frank Dehnewww.dehne.net Kernelization Buss's Algorithm for k-vertex cover Let G=(V,E) and let S be the subset of vertices with degree k or more. Remove S and all incident edges G->G’ k -> k'=k-|S|. IF G' has more than k x k' edges THEN no k-vertex cover exists ELSE start bounded tree search on G'

23 Frank Dehnewww.dehne.net Bounded Tree Search

24 Frank Dehnewww.dehne.net Case 1: simple path of length 3 remove selected vertices from G' k' - = 2

25 Frank Dehnewww.dehne.net Case 2: 3-cycle remove selected vertices from G' k' - = 2

26 Frank Dehnewww.dehne.net Case 3: simple path of length 2 remove v1, v2 from G' k' - = 1

27 Frank Dehnewww.dehne.net Case 4: simple path of length 1 remove v, v1 from G' k' - = 1

28 Frank Dehnewww.dehne.net Sequential Tree Search Depth first search –backtrack when k'=0 and G'<>0 ("dead end" )) –stop when solution found (G'={}, k'>=0 )

29 Frank Dehnewww.dehne.net Parallel Tree Search Basic Idea: –Build top log p levels of the search tree (T ') –every proc. starts depth- first search at one leaf of T ' –randomize depth-first search by selecting random child

30 Frank Dehnewww.dehne.net Analysis: Balls-in-bins sequential depth-first search path total length:L, #solutions: m expected sequential time (rand. distr.): L/(m+1) parallel search path expected parallel time (rand. distr.): p + L/(p(m+1)) expected speedup: p / (1 + (m+1)/L) if m << L then expected speedup = p

31 Frank Dehnewww.dehne.net Simulation Experiment L = 1,000,000

32 Frank Dehnewww.dehne.net Implementation test platform: –32 node HPCVL Beowulf cluster –each node: dual 1.4 GHz Intel Xeon, 512 MB RAM, 60 GB disk –gcc and LAM/MPI on LINUX Redhat 7.2 code-s: Sequential k-vertex cover code-p: Parallel k-vertex cover

33 Frank Dehnewww.dehne.net Test Data Protein sequences Same protein from several hundred species Each protein sequence a few hundred amino acid residues in length Obtained from the National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/)

34 Frank Dehnewww.dehne.net Test Data Somatostatin –neuropeptide involved in the regulation of many functions in different organ systems –Clustal Threshold = 10, |V| = 559, |E| = 33652, k = 273, k' = 255

35 Frank Dehnewww.dehne.net Test Data WW –small protein domain that binds proline rich sequences in other proteins and is involved in cellular signaling –Clustal Threshold = 10, |V| = 425, |E| = 40182, k = 322, k' = 318

36 Frank Dehnewww.dehne.net Test Data Kinase –large family of enzymes involved in cellular regulation –Clustal Threshold = 16, |V| = 647, |E| = 113122, k = 497, k' = 397

37 Frank Dehnewww.dehne.net Test Data SH2 (src-homology domain 2) –involved in targeting proteins to specific sites in cells by binding to phosphor-tyrosine –Clustal Threshold = 10, |V| = 730, |E| = 95463, k = 461, k' = 397

38 Frank Dehnewww.dehne.net Test Data Thrombin –protease involved in the blood coagulation cascade and promotes blood clotting by converting fibrinogen to fibrin –Clustal Threshold = 15, |V| = 646, |E| = 62731, k = 413, k' = 413

39 Frank Dehnewww.dehne.net Test Data PHD (pleckstrin homology domain) –involved in cellular signaling –Clustal Threshold = 10, |V| = 670, |E| = 147054, k = 603, k' = 603

40 Frank Dehnewww.dehne.net Test Data Random Graph |V| = 220, |E| = 2155, k = 122, k' = 122 Grid Graph |V| = 289, |E| = 544, k = 145, k' = 145

41 Frank Dehnewww.dehne.net Test Data |VC| ~ |V| / 2 k' = k

42 Frank Dehnewww.dehne.net Sequential Times Kinase, SH2, Thombin: n/a

43 Frank Dehnewww.dehne.net Code-p on Virtual Proc.

44 Frank Dehnewww.dehne.net Parallel Times

45 Frank Dehnewww.dehne.net Speedup: Somatostatin

46 Frank Dehnewww.dehne.net Speedup: WW

47 Frank Dehnewww.dehne.net Speedup: Rand. Graph

48 Frank Dehnewww.dehne.net Speedup: Grid Graph

49 Frank Dehnewww.dehne.net Clustal W + Parallel Clustal … Parallel FPT MVC Clustal XP Web Portal Clustal XP in progress X : Extended P : Parallel

50 Frank Dehnewww.dehne.net http://cgm.dehne.net Clustal XP

51 Frank Dehnewww.dehne.net


Download ppt "Frank Dehnewww.dehne.net Parallel Computational Biochemistry."

Similar presentations


Ads by Google