Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?

Similar presentations


Presentation on theme: "1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?"— Presentation transcript:

1 1 Multiple sequence alignment Lesson 3

2 2 1. What is a multiple sequence alignment?

3 3 VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Similar to pairwise alignment BUT n sequences are aligned instead of just n=2 Multiple sequence alignment

4 4 MSA = Multiple Sequence Alignment Each row represents an individual sequence Each column represents the ‘same’ position VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWWSNG-- Multiple sequence alignment

5 5 Homo sapiens Pan troglodytes Mus musculus Canis familiaris Gallus gallus Anopheles gambiae Drosophila melanogaster Caenorhabditis elegans Arabidobsis thaliana Rattus norvegicus

6 6 Histone H4 protein

7 7 Multiple sequence alignment NADH dehydrogenase subunit 4 Histone H4 protein 4 ► Which is better – pairwise alignment of a pair of rows in MSA?

8 8 2. How MSAs are computed

9 9 Alignment – Dynamic Programming There is a dynamic programming algorithm for n sequences similar to the pairwise alignment Complexity : O(n |sequences| )

10 10 Alignment methods This is not practical complexity, therefore heuristics are used: Progressive/hierarchical alignment (Clustal) Iterative alignment (mafft, muscle)

11 11 ABCDEABCDE Compute the pairwise alignments for all against all (6 pairwise alignments). The similarities are converted to distances and stored in a table First step: Progressive alignment EDCBA A 8B 1715C 101416D 3231 32E

12 12 A D C B E Cluster the sequences to create a tree (guide tree): represents the order in which pairs of sequences are to be aligned represents the order in which pairs of sequences are to be aligned similar sequences are neighbors in the tree similar sequences are neighbors in the tree distant sequences are distant from each other in the tree distant sequences are distant from each other in the tree Second step: EDCBA A 8B 1715C 101416D 3231 32E The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

13 13 Third step: A D C B E 1. Align the most similar (neighboring) pairs sequence

14 14 Third step: A D C B E 2. Align pairs of pairs sequence profile

15 15 Third step: A D C B E sequence profile Main disadvantages: Sub-optimal tree topology Misalignments resulting from globally aligning pairs of sequences.

16 16 ABCDEABCDE Iterative alignment Guide tree MSA Pairwise distance table A D C B Iterate until the MSA does not change (convergence) E

17 17 3. MSA – What is it good for? A. Conserved positions B. Consensus C. Patterns D. Profiles E. Much more…

18 18 3. MSA – What is it good for? A. Conserved positions B. Consensus C. Patterns D. Profiles E. Much more…

19 19 Consensus sequence TGTTCTA TGTTCAA TCTTCAA TGTTCAA A consensus sequence holds the most frequent character of the alignment at each column

20 20 Consensus sequence – an example TAGCAT TAATAT TAATAT TCATAG TTGTAT TTGTAT The -10 region of six promoters. There are many variants to the “consensus”. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT

21 21 Consensus sequence – an example TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TAATAT 1. Strict majority. * In case of equal frequencies – choose one according to the alphabet order.

22 22 Consensus sequence – an example Had we searched the region upstream of genes for this consensus, we would have identified only 2 out of the 6 sequences. So we will miss many cases. By chance, we expect a “hit” every 4,096 bp. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TAATAT

23 23 Consensus sequence – an example We can search while allowing 1 mismatch. we would have identified 3 out of the 6 sequences. So we will miss less cases. By chance, we expect a “hit” every ~200bp → more “noise”. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TAATAT

24 24 Consensus sequence – an example We can search while allowing 2 mismatches. we would have identified all 6 sequences. So we won’t miss. By chance, we expect a “hit” every ~30bp → A LOT OF “noise”. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TAATAT

25 25 Consensus sequence – an example 2. Majority only when it is a clear case. In the remaining cases – use wildcards. Y = Pyrimidine R = Purine N = Any nucleotide TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TNRTAT

26 26 Reminder: Purines & Pyrimidines Y = Pyrimidine R = Purine N = Any nucleotide

27 27 Consensus sequence – an example Had we searched the region upstream of genes with the redundant consensus, we would have identified 4/6 sequences. By chance, we expect a “hit” every ~500 bp. TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT TNRTAT

28 28 Consensus sequence – an example There is always a tradeoff between sensitivity and specificity. Sensitivity: the fraction of true positive predictions among all positive predictions. Specificity: the fraction of true negative predictions among all negative predictions. TNRTAT TAATAT

29 29 Consensus sequence – an example Sensitivity: the fraction of true positive predictions among all positive predictions Specificity: the fraction of true negative predictions among all negative predictions Permissive consensus: higher sensitivity, lower specificity (more true positives, more false positives  ↔ less true negatives , less false negatives ) Nonpermissive consensus: higher specificity, lower sensitivity (less true positives , less false positives ↔ more true negatives, more false negatives  )

30 30 3. MSA – What is it good for? A. Conserved positions B. Consensus C. Patterns D. Profiles E. Much more…

31 31 Patterns TAGCAT TAATAT TAATAT TCATAG TAGTAT TTGTAT [TG]-A-[TC]-[GA]-[CTA]-[T] Patterns are more informative than consensuses sequences. Pattern specify for each position the possible characters for this position.

32 32 Patterns - syntax The standard IUPAC one-letter codes. ‘x’ : any amino acid. ‘[]’ : residues allowed at the position. ‘{}’ : residues forbidden at the position. ‘()’ : repetition of a pattern element are indicated in parenthesis. X(n) or X(n,m) to indicate the number or range of repetition. ‘-’ : separates each pattern element. ‘‹’ : indicated a N-terminal restriction of the pattern. ‘›’ : indicated a C-terminal restriction of the pattern. ‘.’ : the period ends the pattern.

33 33 W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE] Patterns Any amino-acid, between 9-11 times F or Y or V WOPLASDFGYVWPPPLAWS ROPLASDFGYVWPPPLAWS WOPLASDFGYVWPPPLSQQQ  

34 34 3. MSA – What is it good for? A. Conserved positions B. Consensus C. Patterns D. Profiles E. Much more…

35 35 Profile = PSSM = Position Specific Score Matrix AACCCA GGCCAA TTCCAA 654321.33 00.671A 0011.330C 0000G 0000T

36 36 P(AACCAA)= 1 × 0.67 × 1 × 1 × 0.33 × 0.33 P(GACCAA)= 0 Sequences with higher probabilities → higher chance of being related to the PSSM. 654321.33 00.671A 0011.330C 0000G 0000T Profiles / PSSMs

37 37 One compares each n-mer to the profile and computes the probabilities. Sequences with probabilities > threshold are considered as hits. Searching with PSSM GACGGTACGTAGCGGAGCGACCAA Computes the probability of the first 6-mer 654321.33 00.671A 0011.330C 0000G 0000T

38 38 6-mers with probabilities > threshold are considered as hits. Searching with PSSM P2 P3 P4 GACGGTACGTAGCGGAGCGACCAA P1 654321.33 00.671A 0011.330C 0000G 0000T

39 39 Profile-pattern-consensus GTTCAA GCTGAA CTTCAC 54321.0010.66A.1000T.0 00.33C.0 00G GTTCAA [AC]-A-[GC]-T-[TC]-[GC] multiple alignment consensus pattern profile NNTNAN

40 40 4. HMM: Hidden Markov Models

41 41 Definitions & Uses A probabilistic model which deals with sequences of symbols. Uses: inferring hidden states. Originally used in speech recognition (the symbols being phonemes) Useful in biology – the sequence of symbols being the DNA\Proteins.

42 42 Markov Chains A sequence of random variables X 1,X 2,… where each present state depends only on the previous state. Weather example: The weather in day x depends only on day x-1: We can easily compute the probability of: Sunny  Sunny  Rainy  Sunny  Sunny

43 43 Markov Chains Similarly we can assume a DNA sequence is Markovian A  C  G  G  T  A…(vertical or horizontal!) These conditional probabilities can be illustrated as follows (in DNA) Each arrow has a transition probability: P CA = P(x i =A|X i-1 =C) Thus – the probability of a sequence x will be : AT C G

44 44 Hidden Markov Models The state sequence itself follows a simple Markov chain. But- In a HMM it is no longer possible to know the state by looking at the symbols – the state is hidden. P B PPP BB S i+1 SiSi S i-1 K i+1 KiKi K i-1 S1S1 K1K1 SnSn KnKn...

45 45 The weather HMM example In this weather example only the actions are observable and the weather is hidden:

46 46 {S, K, Π, P, B} S : {s 1 …s N } are the values for the hidden states K : {k 1 …k M } are the values for the observations The hidden states emit/generate the symbols (observations) Π = {Π i } are the initial state probabilities P = {P ij } are the state transition probabilities B = {b ik } are the emission probabilities HMM formalities P B PPP BB S i+1 SiSi S i-1 K i+1 KiKi K i-1 S1S1 K1K1 SnSn KnKn...

47 47 Another HMM example – the dishonest casino In a casino, they use a fair dice most of the time, but occasionally switch to an unfair dice. The switch between dice can be represented by an HMM: 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 FAIRUNFAIR 0.05 0.1 0.95 0.9 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 0.05 0.1 0.95 0.9 UNFAIR FAIR

48 48 Dishonest casino - continued The symbols (observations) are the sequence of rolls: 3 5 6 2 1 4 6 3 6… What is hidden? If the die is fair or unfair: f f f f u u u f f This is a Markov chain. Except for that, we have: Emission probabilities: Given a state, we have 6 possible matching symbols, each with an emission probability. 1: 1/6 2: 1/6 3: 1/6 4: 1/6 5: 1/6 6: 1/6 1: 1/10 2: 1/10 3: 1/10 4: 1/10 5: 1/10 6: 1/2 FAIRUNFAIR 0.05 0.1 0.95 0.9

49 49 HMM of MSA MSA can be represented by an HMM – Insertion of A/C/G/T – Match or Mismatch – Deletion

50 50 HMM of MSA MSA can be represented by an HMM – Insertion of A/C/G/T – Match or Mismatch – Deletion

51 51 HMM of MSA can get more complex…

52 52 Questions where HMM’s are used: Does this sequence belong to a particular family? Can we identify regions in a sequence (for instance – alpha helices, beta sheets)? Pairwise/multiple sequence alignment Searching databases for protein families (building profiles).


Download ppt "1 Multiple sequence alignment Lesson 3. 2 1. What is a multiple sequence alignment?"

Similar presentations


Ads by Google