Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer.

Similar presentations


Presentation on theme: "Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer."— Presentation transcript:

1 Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer

2 Eidhammer et al. Protein Bioinformatics Chapter 4 2 Definition A global alignment of a set of sequences is obtained by –inserting into each sequence gap characters ‘ ’ so that –the resulting sequences are of the same length and so that –no “column” has only gap characters

3 Eidhammer et al. Protein Bioinformatics Chapter 4 3 Example: Chromo domains aligned

4 Eidhammer et al. Protein Bioinformatics Chapter 4 4 Use of alignments High sequence similarity usually means significant structural and/or functional similarity. The reverse does not need to be true Homolog proteins (common ancestor) can vary significantly in large parts of the sequences, but still retain common 2D-patterns, 3D- patterns or common active site or binding site. Comparison of several sequences in a family can reveal what is common for the family (From Lesk: Two homologous sequences whisper,.. A full multiple alignment shouts out load). Something common for several sequences can be significant when regarding all of the sequences, but need not if regarding only two. Multiple alignment can be used to derive evolutionary history.

5 Eidhammer et al. Protein Bioinformatics Chapter 4 5 Use of alignments Predict features of aligned objects –conserved positions structurally/functionally important

6 Eidhammer et al. Protein Bioinformatics Chapter 4 6 Conserved positions

7 Eidhammer et al. Protein Bioinformatics Chapter 4 7 Use of alignments Predict features of aligned objects –conserved positions structurally/functionally important –patterns of hydrophobicity/hydrophilicity secondary structure elements

8 Eidhammer et al. Protein Bioinformatics Chapter 4 8 Helix pattern

9 Eidhammer et al. Protein Bioinformatics Chapter 4 9 Use of alignments Predict features of aligned objects –conserved positions structurally/functionally important –patterns of hydrophobicity/hydrophilicity secondary structure elements –“gappy” regions loops/variable regions

10 Eidhammer et al. Protein Bioinformatics Chapter 4 10 Loop?

11 Eidhammer et al. Protein Bioinformatics Chapter 4 11 Use of alignments Predict features of aligned objects –conserved positions structurally/functionally important –patterns of hydrophobicity/hydrophilicity secondary structure elements –“gappy” regions loops/variable regions –covariation structural proximity

12 Eidhammer et al. Protein Bioinformatics Chapter 4 12

13 Eidhammer et al. Protein Bioinformatics Chapter 4 13 Use of Alignments - make patterns/profiles Can make a profile or a pattern that can be used to match against a sequence database and identify new family members Profiles/patterns can be used to predict family membership of new sequences Databases of profiles/patterns –PROSITE –PFAM –PRINTS –...

14 Eidhammer et al. Protein Bioinformatics Chapter 4 14 Prosite: Motifs for classification Protein sequence Prosite pattern 1 Prosite pattern 2 Prosite pattern n Family 1Family 2Family n Pattern Regular expression Profile

15 Eidhammer et al. Protein Bioinformatics Chapter 4 15 Pattern from alignment [FYL]-x-[LIVMC]-[KR]-W-x-[GDNR]-[FYWLE]-x(5,6)-[ST]-W-[ES]-[PSTDN]-x(3)-[LIVMC]

16 Eidhammer et al. Protein Bioinformatics Chapter 4 16 Alignment problem Given a set of sequences, produce a multiple alignment which corresponds as well as possible to the biological relationships between the corresponding bio-molecules

17 Eidhammer et al. Protein Bioinformatics Chapter 4 17 For homologous proteins Two residues should be aligned (on top of each other) –if they are homologous (evolved from the same residue in a common ancestor protein) –if they are structurally equivalent

18 Eidhammer et al. Protein Bioinformatics Chapter 4 18 Automatic approach Need a way of scoring alignments –fitness function which for an alignment quantifies its “goodness” Need an algorithm for finding alignments with good scores Not all methods provide a scoring function for the final alignment!

19 Eidhammer et al. Protein Bioinformatics Chapter 4 19 Analysis of fitness function One can test whether the alignments optimal under a given fitness function correspond well to the biological relationships between the sequences For example, if the structure of (some of) the proteins are known.

20 Eidhammer et al. Protein Bioinformatics Chapter 4 20 Align by use of dynamic programming Dynamic programming finds best alignment of k sequences with given scoring scheme For two sequences there are three different column types For three sequences there are seven different column types x means an amino acid, - a blank Sequence1 x - x x - - x Sequence2 x x - x - x - Sequence3 x x x - x - x Time complexity of O(n k ) (sequence lengths = n)

21 Eidhammer et al. Protein Bioinformatics Chapter 4 21

22 Eidhammer et al. Protein Bioinformatics Chapter 4 22 Use of dynamic programming Dynamic programming finds best alignment of k sequences given scoring scheme

23 Eidhammer et al. Protein Bioinformatics Chapter 4 23

24 Eidhammer et al. Protein Bioinformatics Chapter 4 24 Algorithm for dynamic programming

25 Eidhammer et al. Protein Bioinformatics Chapter 4 25 Connection alignment and evolutionary tree Consider a set of sequences ARL, ARTL, ARSI, ARSL, AWTL, AWT Alignment AR-L ARTL ARSI ARSL AWTL AWT- Possible tree Use the tree to calculate alignment AWTL ARTL ARSL AWTL ARTL AWTL AWT- AR-L ARSI AWT- AR-L AWT- ARSL ARTL ARSI AR-L ARSL ARSI

26 Eidhammer et al. Protein Bioinformatics Chapter 4 26 Phylogenetic studies The purpose of phylogenetic studies of related objects are to reconstruct the correct genealogical ties between them (the topology); and to estimate the time of divergence between them since they last shared a common ancestor (length of edges in the tree). In phylogenetic studies, the objects are often referred to as operational taxonomic units (OTUs). In our case the objects are protein or nucleic acid sequences. We will denote the set of sequences we have at the start for the original sequences.

27 Eidhammer et al. Protein Bioinformatics Chapter 4 27 Phylogenetic studies

28 Eidhammer et al. Protein Bioinformatics Chapter 4 28 Example

29 Eidhammer et al. Protein Bioinformatics Chapter 4 29 Number of different tree topologies

30 Eidhammer et al. Protein Bioinformatics Chapter 4 30 Additive tree

31 Eidhammer et al. Protein Bioinformatics Chapter 4 31 Additive and ultrametric Lemma1 It is possible to construct an additive tree from the distances between the sequences (metric space) if and only if for any four of them we can label them i,j,k,l such that D i,j + D k,l = D i,k + D j,l >= D i,l + D j,k Lemma2 It is possible to construct an ultrametric tree from the distances between the Sequences (metric space) if and only if for every i,j,k D i,j <= max(D i,k,D k,j )

32 Eidhammer et al. Protein Bioinformatics Chapter 4 32 Maximum parsimony

33 Eidhammer et al. Protein Bioinformatics Chapter 4 33 Parvis gruppering

34 Eidhammer et al. Protein Bioinformatics Chapter 4 34 An example

35 Eidhammer et al. Protein Bioinformatics Chapter 4 35 Neighbour joining

36 Eidhammer et al. Protein Bioinformatics Chapter 4 36

37 Eidhammer et al. Protein Bioinformatics Chapter 4 37 Bootstrapping

38 Eidhammer et al. Protein Bioinformatics Chapter 4 38

39 Eidhammer et al. Protein Bioinformatics Chapter 4 39 General progressive alignment Algorithm 4.3. General progressive alignment. Progressive alignment of the sequences {s 1, s 2,..., s m } var C current set of alignments begin C := ∅ for i := 1 to m do C := C union {{s i }} end one alignment of each sequence for i := 1 to m − 1 do choose two alignments A p,A q from C; C := C − {A p,A q } A r := align(A p,A q );C := C union {A r } end C now contains the (single) final alignment end

40 Eidhammer et al. Protein Bioinformatics Chapter 4 40

41 Eidhammer et al. Protein Bioinformatics Chapter 4 41

42 Eidhammer et al. Protein Bioinformatics Chapter 4 42

43 Eidhammer et al. Protein Bioinformatics Chapter 4 43 Clustering philosophy Join the two groups with highest pairwise score. 1.Average scoring method: find average score over all pasirs in the two groups 2.Maximum scoring method: find maximum score over all pairs in the two groups (needs only one high-scoring pair) 3.Minimum (complete) scoring method: find minimum scoring over all pairs (all pairs are taken into account) 4.Special scoring method

44 Eidhammer et al. Protein Bioinformatics Chapter 4 44

45 Eidhammer et al. Protein Bioinformatics Chapter 4 45 The Clustal Algorithm Three steps: 1Compare all pairs of sequences to obtain a similarity matrix 2Based on the similarity matrix, make a guide tree relating all the sequences 3Perform progressive alignment where the order of the alignments is determined by the guide tree

46 Eidhammer et al. Protein Bioinformatics Chapter 4 46 (A) 1 pairwise comparison 2 clustering/making tree (B) 3 Align according to tree

47 Eidhammer et al. Protein Bioinformatics Chapter 4 47 ClustalW - Score of aligning two alignment columns sum the score matrix entry for all pairs of residues weight each pair by the sequences’ weights 1:peeksavtal 2:geekaavlal 3:egewglvlhv 4:aaektkirsa Score: M(t,v)+M(t,i)+ M(l,v)+M(l,i)

48 Eidhammer et al. Protein Bioinformatics Chapter 4 48 ClustalW - Weighting sequences each sequence is given a weight groups of related sequences receive lower weight Weighted score: w1*w3*M(t,v)+ w1*s4*M(t,i)+ w2*w3*M(l,v)+ w2*w4*M(l,i) 1:peeksavtal 2:geekaavlal 3:egewglvlhv 4:aaektkirsa

49 Eidhammer et al. Protein Bioinformatics Chapter 4 49 ClustalW - Similarity matrix Distance between sequences - measure from the guide tree - determines which matrix to use –80-100% seq-id -> use Blosum80 –60-80% seq-id -> Blosum60 –30-60% seq-id -> Blosum45 –0-30% seq-id -> Blosum30

50 Eidhammer et al. Protein Bioinformatics Chapter 4 50 ClustalW - Gap penalties Initial gap penalty –GOP Gap extension penalty –GEP GTEAKLIVLMANE GA---------KL Penalty: GOP+8*GEP

51 Eidhammer et al. Protein Bioinformatics Chapter 4 51 ClustalW - Modifications of gap penalty Position specific penalty –gap at position yes -> lower GOP no, but gap within 8 residues -> increase GOP –hydrophilic residues lower GOP

52 Eidhammer et al. Protein Bioinformatics Chapter 4 52 Globin alignment Default gap penalty GEP=0.05

53 Eidhammer et al. Protein Bioinformatics Chapter 4 53 Globin alignment - with insert Default gap penalty GEP=0.05

54 Eidhammer et al. Protein Bioinformatics Chapter 4 54 Globin alignment - with insert Lowered gap penalty GEP=0.01

55 Eidhammer et al. Protein Bioinformatics Chapter 4 55 ClustalW - summary Does not use a score for the final alignment Each pairwise alignment is done using dynamic programming Heuristics (e.g., gap-penalty modifications) are used - tailored to globular proteins Graphical version: ClustalX

56 Eidhammer et al. Protein Bioinformatics Chapter 4 56 SAGA: Sequence Alignment by Genetic Algorithm An “objective function” is used to score the alignments An alignment is represented as a bit string A population of alignment is “evolved” Alignments can be combined (cross- over) Alignments can be mutated Alignments with higher score are more likely to be chosen for mating/survival

57 Eidhammer et al. Protein Bioinformatics Chapter 4 57

58 Eidhammer et al. Protein Bioinformatics Chapter 4 58 Evaluation of Alignment Methods Align set of protein sequences where the structures are known (at least for some proteins) Align the protein structures Identify “motifs” from the structure alignment Check if sequence alignment has correctly aligned motifs McClure et al, 1994 Thompson et al, 1999

59 Eidhammer et al. Protein Bioinformatics Chapter 4 59 Alignments are important Basis for other analyses –structure prediction –phylogeny –experiments PCR primer identification site directed mutagenesis... –identification of motifs

60 Eidhammer et al. Protein Bioinformatics Chapter 4 60 Open Problems - space for improvements! Good scoring function for alignments –identify well aligned regions Efficient algorithms Resolving repeat structure, domain movements etc. Incorporating external information

61 Eidhammer et al. Protein Bioinformatics Chapter 4 61 Future development More sequences –More families, but not so many –More densely populated families –“Easier” alignment problem –Identify more ancient relationships (superfamilies) More structures –more sequences can be “threaded” –alignments help


Download ppt "Eidhammer et al. Protein Bioinformatics Chapter 4 1 Multiple Global Sequence Alignment and Phylogenetic trees Inge Jonassen and Ingvar Eidhammer."

Similar presentations


Ads by Google