Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―

Similar presentations


Presentation on theme: "Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―"— Presentation transcript:

1 Sequence Similarity

2 PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―

3 INSERTXINSERTY MATCH A pair-HMM model of pairwise alignment  Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences  Transition probabilities ~ gap penalties  Emission probabilities ~ substitution matrix (from BLOSUM) ABRACA-DABRA AB-ACARDI--- x y xixixixi yjyjyjyj ― yjyjyjyj xixixixi―

4 Computing Pairwise Alignments The Viterbi algorithm  conditional distribution P( α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y  identifies highest probability alignment, α viterbi, in O(L 2 ) time Caveat: the most likely alignment is not the most accurate  Alternative: find the alignment of maximum expected accuracy P(α) P(α | x, y) α viterbi

5 The Lazy-Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key?  Approach #1: Use the answer sheet of the best student!  Approach #2: Weighted majority vote! A-AB A B+B+B- C 4. F 4. T 4. F 4. T

6 Viterbi vs. Maximum Expected Accuracy (MEA) Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes E α * [1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes E α* [accuracy(α, α*)] A 4. T A-AB A B+B+B- C 4. F 4. T 4. F 4. T

7 Computing MEA alignments Define accuracy (α, α*) = E α* (accuracy(α, α*) | x, y) ~ E α* (∑ (xi, yj) in α 1((x i, y j ) in α*) | x,y) = ∑ α’ P(α’ | x, y) ∑ (xi, yj) in α 1((x i, y j ) in α’) = ∑ (xi, yj) in α ∑ α’ P(α’ | x, y) 1((x i, y j ) in α’) = ∑ (xi, yj) in α P(x i, y j in α’ | x, y) Define M[i, j] = posterior probability that x i is aligned to y j # of correct predicted matches length of shorter sequence

8 Computing MEA alignments Define accuracy (α, α*) = Then, MEA alignment is highest summing path through the matrix M[i, j] = P(x i is aligned to y j | x, y) M[i, j] = posterior probability that x i is aligned to y j  Can compute with forward, backward dynamic programming in O(L 2 ) time # of correct predicted matches length of shorter sequence

9 Computing MEA alignments Define accuracy (α, α*) = Then, MEA alignment is highest summing path through the matrix M[i, j] = P(x i is aligned to y j | x, y) M[I, j] = posterior probability that x i is aligned to y j  Can compute with forward, backward dynamic programming in O(L 2 ) time # of correct predicted matches length of shorter sequence

10 The consistency signal z x y xixixixi yjyjyjyj y j’ zkzkzkzk

11 To estimate P(x i  y j | x, y, z) Method 1:triplet-HMM P(x i ~ y j | x, y, z) = ∑ k P(x i ~y j ~z k | x, y, z) Parameters trained with unsupervised EM Running time: O(N 3 L 3 ) N: # sequences L: sequence lengths

12 Probabilistic consistency Compute P(x i is aligned to y j | x, y) P(x i is aligned to y j | x, y, z) 2 approaches:  1) Exact – triplet HMM, O(L 3 ) time  2) Approximate – use independence assumptions ∑ k P(x i ~ z k and z k ~ y j | x, y, z) = ∑ k P(x i ~ z k | x, z) P(z k ~ y j | x, y, z, x i ~ z k )  (assume indep.) ∑ k P(x i ~ z k | x, z) P(z k ~ y j | z, y)

13 Probabilistic consistency Compute P(x i is aligned to y j | x, y, z) To compute P(x i ~ y j | x, y, z) ~ ∑ k P(x i ~ z k | x, z) P(z k ~ y j | z, y) Notice that for any given i, most entries k and j will be close to 0 -- sparse matrices P xy|z  P xz P zy Finally, let P xy|S  1/|S| ∑ z in S P xz P zy

14 Multiple sequence alignment A straightforward generalization  sum-of-pairs  tree-based progressive alignment  iterative refinement ABRACA-DABRA AB-ACARDI--- ABRA---DABI- AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI---

15 Multiple sequence alignment A straightforward generalization  sum-of-pairs  tree-based progressive alignment  iterative refinement ABRACA-DABRA AB-ACARDI--- ABRA---DABI- AB-ACARDI--- ABRA---DABI- ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARDI--- ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABACARDIABRACADABRA ABRACA-DABRA AB-ACARDI--- ABRADABI ABRACA-DABRA AB-ACARDI--- ABRA---DABI- ABACARDI ABRACADABRA ABRA--DABI- ABRACA-DABRA AB-ACARD--I- ABRA---DABI-

16 Summary of P ROB C ONS Algorithm Given K sequences to be aligned, (1)Compute M[i, j] for all pairs of sequences, x and y (2)Use probabilistic consistency to reestimate M[i, j] (3)Build a tree of the sequences by connecting closest first “Closest” defined according to expected accuracy EA(x, y) = E(accuracy) of MEA alignment of x and y (4)Perform progressive alignment along the tree Score of a column: sum-of-pairs M[i, j] (5)Apply iterative refinement

17 Training/testing methodology 3 reference benchmark sets PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE. Quality score: Q(α, α*) = BAliBASEPREFABSABmark # of correct predicted matches total # of true matches

18 Evaluation of Algorithm Components Algorithm Quality (74) Time (sec) Viterbi0.3750.72 MEA0.4031.6 PC (O(L 3 ))0.431584.2 PC x 1 (O(L 2 ))0.4221.7 PC x 2 (O(L 2 ))0.4271.9 Progressive PC x 2 (O(L 2 ))0.4321.9 Progressive PC x 2 (O(L 2 )) + IR0.4353.3 all-pairspairwise multiple

19 Performance of different alignment tools AlgorithmBAliBASE (237) PREFAB (1932) SABmark (698) QtQtQt Align-m0.80419:25--0.35256:44 DIALIGN0.8322:530.57212:25:000.4108:28 CLUSTALW0.8611:070.5892:57:000.4392:16 MAFFT0.8821:180.6482:36:000.4427:33 T-Coffee0.88321:310.636144:51:000.45659:10 MUSCLE0.8961:050.6483:11:000.46420:42 P ROB C ONS 0.9105:320.66819:41:000.50517:20

20 Resources for alignment Protein Multiple Aligners http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used(1994) http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable(2004) http://probcons.stanford.edu/ PROBCONS – most accurate(2004) Some more protein multiple aligners: MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc……. ProbCons: Chuong (Tom) Do

21 Profile hidden Markov models for sequence famillies

22

23 PFAM Protein FAMilies database of alignments Profile HMMs describe each family For each family in Pfam you can:  Look at multiple alignments  View protein domain architectures  Examine species distribution  Follow links to other databases  View known protein structures

24 PFAM Pfam-A – curated multiple alignments  Grows slowly; quality controlled by experts Pfam-B – automatic clustering (ProDom derived)  New sequences instantly incorporated; unchecked Search by: Sequence, keyword, domain, taxonomy Browsing by family or genome Evolutionary tree Source of seed alignments:  Pfam-B families  Published articles  ‘Domain hunting' studies

25

26

27

28

29

30

31 Profile HMMs Each M state has a position-specific pre-computed substitution table Each I state has position-specific gap penalties (and in principle can have its own emission distributions) Each D state also has position-specific gap penalties  In principle, D-D transitions can also be customized per position M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1

32 Profile HMMs  transition between match states – α M(i)M(i+1)  transitions between match and insert states – α M(i)I(i), α I(i)M(i+1)  transition within insert state – α I(i)I(i)  transition between match and delete states – α M(i)D(i+1), α D(i)M(i+1)  transition within delete state – α D(i)D(i+1)  emission of amino acid b at a state S – ε S (b) M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1

33 Profile HMMs  transition probabilities ~ frequency of a transition in alignment  emission probabilities ~ frequency of an emission in alignment  pseudocounts are usually introduced M1M1 M2M2 MmMm Protein Family F BEGIN I0I0 I1I1 I m-1 D1D1 D2D2 DmDm END ImIm D m-1

34 Alignment of a protein to a profile HMM To align sequence x 1 …x n to a profile HMM: We will find the most likely alignment with the Viterbi DP algorithm Define  V j M (i):score of best alignment of x 1 …x i to the HMM ending in x i being emitted from M j  V j I (i):score of best alignment of x 1 …x i to the HMM ending in x i being emitted from I j  V j D (i):score of best alignment of x 1 …x i to the HMM ending in D j (x i is the last character emitted before D j ) Denote by q a the frequency of amino acid a in a ‘random’ protein

35 Alignment of a protein to a profile HMM V j-1 M (i – 1) + log α M(j-1)M(j) V j M (i) = log (ε M(j) (x i ) / q xi ) + max V j-1 I (i – 1) + log α I(j-1)M(j) V j-1 D (i – 1) + log α D(j-1)M(j) V j M (i – 1) + log α M(j)I(j) V j I (i) = log (ε I(j) (x i ) / q xi ) + max V j I (i – 1) + log α I(j)I(j) V j D (i – 1) + log α D(j)I(j) V j-1 M (i) + log α M(j-1)D(j) V j D (i) = max V j-1 I (i) + log α I(j-1)D(j) V j-1 D (i) + log α D(j-1)D(j)

36 Weight of each sequence One simple weighting scheme is to find how much edge length each leaf contributes  Example: edge 1 belongs to a  Example: edge 3 belongs both to a, and to b: e 3 e 1 /(e 1 +e 2 ) goes to a Δ wi = e current w i / (  leaves k below e current w k ) a b c d e f g h i 1 3 2

37 How to build a profile HMM

38 Resources on the web HMMer – a free profile HMM software  http://hmmer.wustl.edu/ http://hmmer.wustl.edu/ SAM – another free profile HMM software  http://www.cse.ucsc.edu/research/compbio/sam.html http://www.cse.ucsc.edu/research/compbio/sam.html PFAM – database of alignments and HMMs for protein families and domains  http://www.sanger.ac.uk/Software/Pfam/ http://www.sanger.ac.uk/Software/Pfam/ SCOP – a structural classification of proteins  http://scop.berkeley.edu/data/scop.b.html http://scop.berkeley.edu/data/scop.b.html


Download ppt "Sequence Similarity. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―"

Similar presentations


Ads by Google