Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j.

Similar presentations


Presentation on theme: "Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j."— Presentation transcript:

1 Sequence Similarity

2 The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j ending in state M  I(i, j): most likely alignment of x 1 …x i with y 1 …y j ending in state I  J(i, j): most likely alignment of x 1 …x i with y 1 …y j ending in state J M(i, j) = log( Prob(x i, y j ) ) + max{ M(i-1, j-1) + log(1-2  ), I(i-1, j) + log(1-  ), J(i, j-1) + log(1-  ) } I(i, j) = max{ M(i-1, j) + log , I(i-1, j) + log  } M P(x i, y j ) I P(x i ) J P(y j ) log(1 – 2  ) log(1 –  ) log  log  log(1 –  ) log Prob(x i, y j )

3 One way to view the state paths – State M x1x1 xmxm y1y1 ynyn ……

4 State I x1x1 xmxm y1y1 ynyn ……

5 State J x1x1 xmxm y1y1 ynyn ……

6 Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) x1x1 xmxm y1y1 ynyn ……

7 Putting it all together States I(i, j) are connected with states J and M (i-1, j) States J(i, j) are connected with states I and M (i-1, j) States M(i, j) are connected with states J and I (i-1, j-1) Optimal solution is the best scoring path from top-left to bottom-right corner This gives the likeliest alignment according to our HMM x1x1 xmxm y1y1 ynyn ……

8 Yet another way to represent this model Mx 1 Mx m Sequence X BEGIN IyIy IyIy IxIx IxIx END We are aligning, or threading, sequence Y through sequence X Every time y j lands in state x i, we get substitution score s(x i, y j ) Every time y j is gapped, or some x i is skipped, we pay gap penalty

9 From this model, we can compute additional statistics P(x i ~ y j | x, y)The probability that positions i, j align, given that sequences x and y align P(x i ~ y j | x, y) =  α: alignment P(α | x, y) 1(x i ~ y j in α) We will not cover the details, but this quantity can also be calculated with DP M P(x i, y j ) I P(x i ) J P(y j ) log(1 – 2  ) log(1 –  ) log  log  log(1 –  ) log Prob(x i, y j )

10 Fast database search – BLAST (Basic Local Alignment Search Tool) Main idea: 1.Construct a dictionary of all the words in the query 2.Initiate a local alignment for each word match between query and DB Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB

11 BLAST  Original Version Dictionary: All words of length k (~11 nucl.; ~4 aa) Alignment initiated between words of alignment score  T (typically T = k) Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold …… query DB query scan

12 PSI-BLAST Given a sequence query x, and database D 1.Find all pairwise alignments of x to sequences in D 2.Collect all matches of x to y with some minimum significance 3.Construct position specific matrix M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4.Using the matrix M, search D for more matches 5.Iterate 1–4 until convergence Profile M

13 BLAST Variants BLASTN – genomic sequences BLASTP – proteins BLASTX – translated genome versus proteins TBLASTN – proteins versus translated genomes TBLASTX – translated genome versus translated genome PSIBLAST – iterated BLAST search http://www.ncbi.nlm.nih.gov/BLAST

14 Multiple Sequence Alignments

15 Protein Phylogenies Proteins evolve by both duplication and species divergence

16

17

18 Definition Given N sequences x 1, x 2,…, x N :  Insert gaps (-) in each sequence x i, such that All sequences have the same length L Score of the global map is maximum A faint similarity between two sequences becomes significant if present in many Multiple alignments can help improve the pairwise alignments

19 Scoring Function: Sum Of Pairs Definition: Induced pairwise alignment A pairwise alignment induced by the multiple alignment Example: x:AC-GCGG-C y:AC-GC-GAG z:GCCGC-GAG Induces: x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAG y: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG

20 Sum Of Pairs (cont’d) Heuristic way to incorporate evolution tree: Human Mouse Chicken Weighted SOP: S(m) =  k<l w kl s(m k, m l ) w kl : weight decreasing with distance Duck

21 A Profile Representation Given a multiple alignment M = m 1 …m n  Replace each column m i with profile entry p i Frequency of each letter in  # gaps Optional: # gap openings, extensions, closings  Can think of this as a “likelihood” of each letter in each position - A G G C T A T C A C C T G T A G – C T A C C A - - - G C A G – C T A C C A - - - G C A G – C T A T C A C – G G C A G – C T A T C G C – G G A 1 1.8 C.6 1.4 1.6.2 G 1.2.2.4 1 T.2 1.6.2 -.2.8.4.8.4

22 Multiple Sequence Alignments Algorithms

23 Multidimensional DP Generalization of Needleman-Wunsh: S(m) =  i S(m i ) (sum of column scores) F(i 1,i 2,…,i N ): Optimal alignment up to (i 1, …, i N ) F(i 1,i 2,…,i N )= max (all neighbors of cube) (F(nbr)+S(nbr))

24 Example: in 3D (three sequences): 7 neighbors/cell F(i,j,k) = max{ F(i-1,j-1,k-1)+S(x i, x j, x k ), F(i-1,j-1,k )+S(x i, x j, - ), F(i-1,j,k-1)+S(x i, -, x k ), F(i-1,j,k )+S(x i, -, - ), F(i,j-1,k-1)+S( -, x j, x k ), F(i,j-1,k )+S( -, x j, x k ), F(i,j,k-1)+S( -, -, x k ) } Multidimensional DP

25 Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP

26 Running Time: 1.Size of matrix:L N ; Where L = length of each sequence N = number of sequences 2.Neighbors/cell: 2 N – 1 Therefore………………………… O(2 N L N ) Multidimensional DP How do gap states generalize? VERY badly!  Require 2 N states, one per combination of gapped/ungapped sequences  Running time: O(2 N  2 N  L N ) = O(4 N L N ) XYXYZZ YYZ XXZ

27 Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z p xy p zw p xyzw

28 Progressive Alignment When evolutionary tree is known:  Align closest first, in the order of the tree  In each step, align two sequences x, y, or profiles p x, p y, to generate a new alignment with associated profile p result Weighted version:  Tree edges have weights, proportional to the divergence in that edge  New profile is a weighted average of two old profiles x w y z Example Profile: (A, C, G, T, -) p x = (0.8, 0.2, 0, 0, 0) p y = (0.6, 0, 0, 0, 0.4) s(p x, p y ) = 0.8*0.6*s(A, A) + 0.2*0.6*s(C, A) + 0.8*0.4*s(A, -) + 0.2*0.4*s(C, -) Result: p xy = (0.7, 0.1, 0, 0, 0.2) s(p x, -) = 0.8*1.0*s(A, -) + 0.2*1.0*s(C, -) Result: p x- = (0.4, 0.1, 0, 0, 0.5)

29 Progressive Alignment When evolutionary tree is unknown:  Perform all pairwise alignments  Define distance matrix D, where D(x, y) is a measure of evolutionary distance, based on pairwise alignment  Construct a tree  Align on the tree x w y z ?

30 Heuristics to improve alignments Iterative refinement schemes A*-based search Consistency Simulated Annealing …

31 Iterative Refinement One problem of progressive alignment: Initial alignments are “frozen” even when new evidence comes Example: x:GAAGTT y:GAC-TT z:GAACTG w:GTACTG Frozen! Now clear correct y = GA-CTT

32 Iterative Refinement Algorithm (Barton-Stenberg): 1.For j = 1 to N, Remove x j, and realign to x 1 …x j-1 x j+1 …x N 2.Repeat 4 until convergence x y z x,z fixed projection allow y to vary

33 Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA Variant: Refinement on a tree “tree partitioning”

34 Iterative Refinement Example: align (x,y), (z,w), (xy, zw): x:GAAGTTA y:GAC-TTA z:GAACTGA w:GTACTGA After realigning y: x:GAAGTTA y:G-ACTTA + 3 matches z:GAACTGA w:GTACTGA

35 Iterative Refinement Example not handled well: x:GAAGTTA y 1 :GAC-TTA y 2 :GAC-TTA y 3 :GAC-TTA z:GAACTGA w:GTACTGA Realigning any single y i changes nothing

36 Some Resources http://www.ncbi.nlm.nih.gov/BLAST BLAST & PSI-BLAST http://www.ebi.ac.uk/clustalw/ CLUSTALW – most widely used http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py MUSCLE – most scalable http://probcons.stanford.edu/ PROBCONS – most accurate

37 MUSCLE at a glance 1.Fast measurement of all pairwise distances between sequences D DRAFT (x, y) defined in terms of # common k-mers (k~3) – O(N 2 L logL) time 2.Build tree T DRAFT based on D DRAFT, with a hierarchical clustering method (UPGMA) 3.Progressive alignment over T DRAFT, resulting in multiple alignment M DRAFT 4.Measure distances D(x, y) based on M DRAFT 5.Build tree T based on D 6.Progressive alignment over T, to build M 7.Iterative refinement; for many rounds, do: Tree Partitioning: Split M on one branch and realign the two resulting profiles If new alignment M’ has better sum-of-pairs score than previous one, accept

38 PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins INSERTXINSERTY MATCH xixixixi yjyjyjyj ― yjyjyjyj xixixixi―

39 INSERTXINSERTY MATCH A pair-HMM model of pairwise alignment  Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences  Transition probabilities ~ gap penalties  Emission probabilities ~ substitution matrix ABRACA-DABRA AB-ACARDI--- x y xixixixi yjyjyjyj ― yjyjyjyj xixixixi―

40 Computing Pairwise Alignments The Viterbi algorithm  conditional distribution P( α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y  identifies highest probability alignment, α viterbi, in O(L 2 ) time Caveat: the most likely alignment is not the most accurate  Alternative: find the alignment of maximum expected accuracy P(α) P(α | x, y) α viterbi

41 The Lazy-Teacher Analogy 10 students take a 10-question true-false quiz How do you make the answer key?  Approach #1: Use the answer sheet of the best student!  Approach #2: Weighted majority vote! A-AB A B+B+B- C 4. F 4. T 4. F 4. T

42 Viterbi vs. Maximum Expected Accuracy (MEA) Viterbi picks single alignment with highest chance of being completely correct mathematically, finds the alignment α that maximizes E α * [1{α = α*}] Maximum Expected Accuracy picks alignment with highest expected number of correct predictions mathematically, finds the alignment α that maximizes E α* [accuracy(α, α*)] A 4. T A-AB A B+B+B- C 4. F 4. T 4. F 4. T


Download ppt "Sequence Similarity. The Viterbi algorithm for alignment Compute the following matrices (DP)  M(i, j):most likely alignment of x 1 …x i with y 1 …y j."

Similar presentations


Ads by Google