1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.

1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida

2 Goals Understand major sequence comparison algorithms. Gain hands on experience

3 Why Compare Sequences ? Prediction of function Construction of phylogeny Shotgun assembly Finding motifs Understanding of biological processes

4 Question Q = AATTCGA X = ACATCGG Y = CATTCGCC Z = ATTCCGC Form groups of 2-3. Sort X, Y, and Z in decreasing similarity to Q. (5 min)

5 Dot Plot A A T T C G A ACATCGGACATCGG How can we compute similarity? O(m+n) time Is it a good scheme ? Use longer subsequences (k-gram)

6 Dot Plot A A T T C G A ACATCGGACATCGG Use longer subsequences (k-gram)

7 Sequence Comparison How to align –Global alignment: align entire sequences –Local alignment: align subsequences How to evaluate –Distance –Score

8 Global Alignment Q = AATTCGA |rr|||r X = ACATCGG 4 match 3 mismatch Q = A-ATTCGA |i|d|||r X = ACA-TCGG 5 match 1 insert 1 delete 1 mismatch Similarity is defined in terms of Distance / Score of alignment Many combinations of Insert / delete / (mis)match

9 Each Alignment Maps to a Path A A T T C G ACATCGACATCG

10 Edit Distance Minimum number of insert / delete / replace operators to transform one sequence into the other. Q = AATTCGA | ||| => 3 X = ACATCGG How do we find the minimum edit distance ?

11 Global sequence alignment (Needleman-Wunsch) Compute distance recursively : dynamic programming. Case 1 : match (0) or mismatch (1) Case 2 : delete (1) Case 3 : insert (1) Case 0 : one string is empty (n)

12 Optimal string alignment D(i,j) = edit distance between A(1:i) and B(1:j) d(a,b) = 0 if a = b, 1 otherwise. Recurrence relation –D(i,0) = Σ d(A(k),-), 0 <= k <= i –D(0,j) = Σ d(-,B(k)), 0 <= k <= j –D(i,j) = Min { D(i-1,j) + d(A(i),-), D(i,j-1) + d(-,B(j)), D(i-1,j-1) + d(A(i),B(j))} Global sequence alignment (Needleman-Wunsch)

13 DP Example A A T T C G ACATCGACATCG D(i,0) = D(i,0) = Σ d(A(k),-), 0 <= k <= i D(0,j) = D(0,j) = Σ d(-,B(k)), 0 <= k <= j D(i,j) = D(i,j) = Min { D(i-1,j) + d(A(i),-), D(i,j-1) + d(-,B(j)), D(i-1,j-1) + d(A(i),B(j))}

14 DP Example: Backtracking A A T T C G 0123456 1012345 2112334 3212344 4321234 5432223 6543332 ACATCGACATCG O(mn) time and space Reconstruct alignment O(max{m,n}) space if alignment not needed. How ?

15 Number of Alignments N(n, m) = number of alignments of sequences of n and m letters (not necessarily optimal alignment). N(0, i) = N(i, 0) = 1 N(n, m) = N(n-1, m) + N(n, m-1) + N(n-1,m-1) N(n, n) ~ (1 + 2 1/2 ) 2n+1 n -1/2. N(1000, 1000) > 10 767 10 80 atoms in the universe !

16 Edit Distance: a Good Measure? Compare these two alignments. Which one is better ? Q = AATTCGA | ||| X = ACATCGG Q = A-ATTCGA | | ||| X = ACA-TCGG Scoring scheme: +1 for each match -1 for each mismatch/indel Can be computed the same as edit distance by including +1 for each match

17 More Trouble: Scoring Matrices Different mutations may occur at different rates in nature. Why ? E.g., each amino acid = three nucleotides. Transformation of one amino acid to other due to single nucleotide modification may be biased –E = GAA, GAG –D = GAU, GAC –F = UUU, UUC –E similar to D, not similar to F Mutation probability of different pairs of nucleotides may differ. PAM, BLOSUM matrices

18 A R N D C Q E G H I L K M F P S T W Y V A 5 -2 -1 -2 -1 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -2 -2 0 R -2 7 0 -1 -3 1 0 -2 0 -3 -2 3 -1 -2 -2 -1 -1 -2 -1 -2 N -1 0 6 2 -2 0 0 0 1 -2 -3 0 -2 -2 -2 1 0 -4 -2 -3 D -2 -1 2 7 -3 0 2 -1 0 -4 -3 0 -3 -4 -1 0 -1 -4 -2 -3 C -1 -3 -2 -3 12 -3 -3 -3 -3 -3 -2 -3 -2 -2 -4 -1 -1 -5 -3 -1 Q -1 1 0 0 -3 6 2 -2 1 -2 -2 1 0 -4 -1 0 -1 -2 -1 -3 E -1 0 0 2 -3 2 6 -2 0 -3 -2 1 -2 -3 0 0 -1 -3 -2 -3 G 0 -2 0 -1 -3 -2 -2 7 -2 -4 -3 -2 -2 -3 -2 0 -2 -2 -3 -3 H -2 0 1 0 -3 1 0 -2 10 -3 -2 -1 0 -2 -2 -1 -2 -3 2 -3 I -1 -3 -2 -4 -3 -2 -3 -4 -3 5 2 -3 2 0 -2 -2 -1 -2 0 3 L -1 -2 -3 -3 -2 -2 -2 -3 -2 2 5 -3 2 1 -3 -3 -1 -2 0 1 K -1 3 0 0 -3 1 1 -2 -1 -3 -3 5 -1 -3 -1 -1 -1 -2 -1 -2 M -1 -1 -2 -3 -2 0 -2 -2 0 2 2 -1 6 0 -2 -2 -1 -2 0 1 F -2 -2 -2 -4 -2 -4 -3 -3 -2 0 1 -3 0 8 -3 -2 -1 1 3 0 P -1 -2 -2 -1 -4 -1 0 -2 -2 -2 -3 -1 -2 -3 9 -1 -1 -3 -3 -3 S 1 -1 1 0 -1 0 0 0 -1 -2 -3 -1 -2 -2 -1 4 2 -4 -2 -1 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -1 -1 2 5 -3 -1 0 W -2 -2 -4 -4 -5 -2 -3 -2 -3 -2 -2 -2 -2 1 -3 -4 -3 15 3 -3 Y -2 -1 -2 -2 -3 -1 -2 -3 2 0 0 -1 0 3 -3 -2 -1 3 8 -1 V 0 -2 -3 -3 -1 -3 -3 -3 -3 3 1 -2 1 0 -3 -1 0 -3 -1 5 The BLOSUM45 Matrix

19 score(H,P) = -2, gap Penalty = –8 HEAGAWGHEE 0-8-16-24-32-40-48-56-64-72-80 P-8-2 A-16 W-24 H-32 E-40 A-48 E-56

20 Score(E,P) = 0, score(E,A) = -1, score(H,A) = -2 HEAGAWGHEE 0-8-16-24-32-40-48-56-64-72-80 P-8-2-8 A-16-10-3 W-24 H-32 E-40 A-48 E-56

21 HEAGAWGHEE 0-8-16-24-32-40-48-56-64-72-80 P-8-2-8-16-24-33-42-49-57-65-73 A-16-10-3-4-12-19-28-36-44-52-60 W-24-18-11-6-7-15-4-12-21-29-37 H-32-14-18-13-8-9-12-6-2-11-19 E-40-22-8-16 -9-12-14-64-5 A-48-30-16-3-11 -12 -14-42 E-56-38-24-11-6-12-14-15-12-82 H E A G A W G H E - E - P - - A W - H E A E Optimal alignment:

22 Distance v.s. Similarity Similarity model: s(a,b), g’(k) Distance model: d(a,b), g(k) If there is a constant c, such that –S(a,b) = c – d(a,b) –G’(k) = g(k) – kc/2 Then Similarity optimal alignment = distance optimal alignment

23 Global Alignment ? Q = A-ATTCGA | | ||| X = ACA-TCGG Q = AATTCGA- ||||| Y = CATTCGCC Which one is more similar to Q ? Local alignment: highest scoring subsequence alignment. How can we find it ? Brute force: O(n 3 m 3 ) Gotoh (Smith-Waterman): O(nm)

24 Local Suffix Alignment V(i, 0) = v(0, j) = 0 V(i,j) = max{0, v(i-1, j-1) + s(x(i), y(j)), v(i-1, j) + s(x(i), -) v(i, j-1) + s(-, y(j))} X[1: i] Y[1: j]

25 Local Alignment The prefixes with highest local suffix alignment

26 --GCTGGAAGGCAT 000000000000 G0510551055100 C01106211011 62 A006620662061511 G0522 732 73 A011077 877387 G05105 771312844 C0010627739817139 A006623111285132218 C0052207884 G0511775413 14 P’s subsequence: G C A G A G C A Q’s subsequence: G A A G – G C A P Q Match = +5 Mismatch = -4 Local Alignment Example

27 Goals Other important sequence comparison problems –banded alignment –end free search –pattern search –non-overlapping alignments –gaps –linear-space algorithms –bitwise operations –neighborhood searching –NFAs –Approximate alignment

28 Banded Global Alignment Two sequences differ by at most w edit operations (w<<n). How can we align ?

29 Banded Alignment Example O(wn) time and space. Example: –w=3. –Match = +1 –Mismatch = -1 –Indel = -2 ACCACACA 0-2-4 -6 A -2 1-3-5 C -4 20-2-4 A -6 -3011-3 C -5-21020 C -40111 A -30202 T -2010 A 0 2

30 End space free alignment --CCA-TGAC TTCCAGTG-- How can we find it ?

31 End space free alignment --CCA-TGAC TTCCAGTG--

32 Pattern search AAGCAGCCATGACGGAAAT CCAGTG How can we find it ?

33 Pattern search AAGCAGCCATGACGGAAAT CCAGTG

34 Non-overlapping Local Alignments GCTCTGCGAATA CGTTGAGATACT Find all non- overlapping local alignments with score > threshold. Two alignments overlap if they share same letter pair. How do we find ?

35 Non-overlapping Local Alignments 1.Compute DP matrix 2.Find the largest scoring alignment > threshold 3.Report the alignment 4.Remove the effects of the alignment from the matrix 5.Go to step 2

36 Next: Closer look into gaps

37 Gaps Q = AATTCGAG ||||| Y = -ATTCGC- Q = AATTCGAG ||||| Z = AATTCC-- Which one is more similar to Q ? Starting an indel is less likely than continuing an indel. Affine gap model: Large gap open and smaller gap extend penalty. How can we compute it ?

38 Computing affine gaps 3 cases E F G i j i j i j

39 Recursions E(i, 0) = gap_open + i x gap_extend E(i,j) = max{E(i, j-1) + gap_extend, V(i, j-1) + gap_open + gap_extend} E i j

40 Recursions F(0, j) = gap_open + j x gap_extend F(i,j) = max{F(i-1, j) + gap_extend, V(i-1, j) + gap_open + gap_extend} F i j

41 Recursions G(i,j) = G(i-1, j-1) + s(x(i), y(j)) G i j

42 Recursions V(i, 0) = gap_open + i x gap_extend V(0, j) = gap_open + j x gap_extend V(i, j) = max{E(i, j), F(i, j), G(i, j)}

43 Other Gap Models Constant: fixed gap penalty per gap regardless of length Non-linear: Gap cost increase is non- linear. –E.g., g(n) = -(1 + ½ + 1/3 + … + 1/n) Arbitrary

44 DP in Linear Space ?

45 Linear Space DP Keep two vectors at a time: –Two columns or two rows O(min{m,n}) space O(mn) time No backtracking A A T T C G ACATCGACATCG

46 Linear Space DP with Backtracking Find midpoint of the alignment –Align the first half –Align the second half –Choose the point with best sum of score/distance Search the upper left and lower right of mid point

47 Linear Space DP with Backtracking: Time Complexity 2(n/2 x m) = nm 2(n/4 x k) + 2(n/4 x (m-k)) = nm/2 … nm/2 i Adds up to 2nm

48 Next: inversions

49 Alignment with Inversions A’ = T and G’ = C ACTCTCTCGCTGTACTG AATCT-ACTACTGCTTG Each letter is inverted only once. An inversion cost (inv) for each inverted block. How to find the alignment ?

50 Alignment with Inversions 1.For i=1:m 1.For j=1:n 1.For g=1:I 1.For h=1:j 1.Compute Z(g,h; I,j) 2.V(I,j) = max{ »Max{v(i-1,j-1) + z(g,h; I,j)} + inv »V(i-1,j-1) + s(x i, y j ) »V(i-1, j) + ins »V(I, j-1) + del} O(n 6 ) time

51 Alignment with Inversions: Faster Method 1.Find all local alignments of x and y’ (Z) 2.V(I,j) = max{ 1.max{V(g-1, h-1) + Z(g, h; I, j)} + inv, 2.V(i-1, j-1) + s(x i, y j ), 3.V(i-1, j) + ins 4.V(I, j-1) + del } O(nmL) time, where L is the average number of inverse alignments ending at (i,j)

52 Recap & Goals Other important sequence comparison problems –banded alignment –end free search –pattern search –non-overlapping alignments –gaps –linear-space algorithms –inversions –bitwise operations –neighborhood searching –NFAs –Approximate alignment –Homology

53 Pattern Searching with Bitwise Operations UM-92 (A3)

54 Pattern Searching with Bitwise Operations (1) Simple case : Find all exact matches to y in x Rj[i] = 1 if first i letters of y matches last i letters of x. R0[i] = 1 (if i = 0) 0 (if 0 < i <= m) Rj+1[i] = –1 (if Rj[i-1] = 1 and y[i] = x[j]) 0 (else) Match if Rj[m] = 1 x y m n

55 Pattern Searching with Bitwise Operations (2) Si[k] = –1 if y[i] = kth letter in the alphabet –0 else –(for i = 1, 2, …, m) Rj+1 = (right shift of Rj) AND (Si) –where x[j+1] = ith letter in the alphabet

56 Pattern Searching with Bitwise Operations (3) AATAACAATACAT AATAC AATAACAATACAT A1 A0 T0 A0 C0 ACGT A1000 A1000 T0001 A1000 C0100 1100011000 1101011010 1100011000 AND RS

57 Pattern Searching with Bitwise Operations (3) AATAACAATACAT AATAC AATAACAATACAT A11 A01 T00 A00 C00 ACGT A1000 A1000 T0001 A1000 C0100

58 Pattern Searching with Bitwise Operations (4) Harder case: one edit distance allowed Use R and R 1 R for exact match R 1 j[i] = 1 if first i letters of y matches last i letters of x with at most one edit operation.

59 Pattern Searching with Bitwise Operations (5) Insertion 1.y[1:i] matches x[:j] exactly insert x[j+1] 2.y[1:i-1] matches x[:j] with one insertion match y[i] with x[j+1] if they are equal R 1 j+1 = (Rj) OR ((right shift of R 1 j) AND (Si)) –where x[j+1] = ith letter in the alphabet Similar reasoning for delete and replace

60 Pattern Searching with Bitwise Operations (6) AATAACAATACAT A1101 A0100 T0010 A0001 C0000 AATAACAATACAT A1111 A0111 T0011 A0001 C0000 R R1R1

61 General problem: k edit operations are allowed Use R 1, R 2, …, R k Update R z+1 using R z and R z+1 Improve running time by partitioning y into k+1 pieces (next slide). Pattern Searching with Bitwise Operations (7)

62 Improving Running Time of Approximate Pattern Search For searching k edit distance threshold Partition y into k+1 pieces At least one of them is an exact match. Why ? (Dirichlet principle) k = 3

63 Dirichlet (pigeonhole) Principle NK balls K+1 boxes Put balls in boxes At least one box contains < N balls

64 Improving Running Time of Approximate Pattern Search For searching k edit distance threshold Partition y into k+1 pieces At least one of them is an exact match. Why ? (Dirichlet principle) k = 3

65 Improving Running Time of Approximate Pattern Search Search each partition for an exact match. Align around the exact matches only Is it a good idea ? (k+1)n/(A m/(k+1) ) random matches, where A is the alphabet size k = 3

66 Neighborhood Searching Myers-94 (A4)

67 Neighborhood Searching (1) Find all subsequences of x within D edit distance to y. Assumption: m = log A n D-neighborhood of y = D-N(y) = set of all sequences within D edit distance to y 1.Find D-N(y) 2.Find exact matches to all the sequences in D-N(y) in x x y m n

68 Neighborhood Searching (2) Condensed D-neighborhood of y = D-N’(y) = Sequences in D-N’(y) which do not contain a prefix in D-N’(y)

69 Searching Neighbors: Hash Table Fix the length of the sequence to search (say m) Create a hash table for all subsequences of x of length m. Lookup for query sequence of length m 01234567 CACACATGGTA AAAA -> # … ACAC -> 1 … CACA -> 0, 2 … TTTT -> #

70 Hash Table {A, C, G, T} {0, 1, 2, 3} {00, 01, 10, 11} GTCAT –101101 = 29 –(((2 x 4) + 3) x 4) + 1 GTCAT –(((3 x 4) + 1) x 4) + 0 –52 O(n) space and construction time What happens when query is –shorter ? –longer ? What happens when –alphabet size is large ? –m is large ?

71 Neighborhood Searching (3) O(Dn) worst case O(Dn f(D/m) log n) expected time, where f(D/m) is an increasing concave function What if m is large (i.e., m > log A n) ? –Dirichlet principle

72 Using NFA for Sequence Matching Baeza-Yates, Navarro – 99 (A6) x y m n

73 Using NFA for Sequence Matching (1) match mismatch delete pattern insert pattern 1: active state 0: inactive state NFA for pattern “patt”Search inside “waitt”

74 Using NFA for Sequence Matching (2) If A(i,j) = 1 then A(i+d,j+d) = 1 for d>0 Keep each diagonal’s first active node. Di = k if the first active node in diagonal I is k. Computation of D (next slide) A’(i,j) = (A(i,j-1) AND x(k) == y(j)) OR A(i-1,j) OR A(i-1,j-1) OR A’(i-1,j-1)

75 Using NFA for Sequence Matching (3)

76 Using NFA for Sequence Matching (4) NFA gets large for long y and large error threshold How can we manage long y ? –Dirichlet principle

77 Using NFA for Sequence Matching (5) Extension: Searching multiple patterns in parallel

78 Approximate Global Alignment of Sequences T. Kahveci, V. Ramaswamy, H. Tao, T. Li - 2005

79 The problem Given sequences X and Y –Bounded: Find global alignment of X and Y with at most k edit ops. –Unbounded: Find global alignment of X and Y with p% approximation p = 100 % = optimal alignment.

80 Frequency Vectors [KS’01] Frequency vector is the count of each letter. –f(s = AATGATAG) = [4, 0, 2, 2]. Edit operations & frequency vectors: –(del. G), s = AAT.ATAG => f(s) = [4, 0, 1, 2] –(ins. C), s = AACTATAG => f(s) = [4, 1, 1, 2] –(A  C), s = ACCTATAG => f(s) = [3, 2, 1, 2] Use frequency vectors to measure distance! nAnA nGnG nCnC nTnT

81 An Approximation to ED: Frequency Distance (FD) s = AATGATAG => f(s)=[4, 0, 2, 2] q = ACTTAGC => f(q)=[2, 2, 1, 2] –dec = (4-2) + (2-1) = 3 –inc = (2-0) = 2 –FD(f(s),f(q)) = 3 –ED(q,s) = 4 FD(f(s 1 ),f(s 2 ))=max{inc,dec}. FD(f(s 1 ),f(s 2 ))  ED(s 1,s 2 ).

82 Distance Prediction using Frequency Vectors A C T - - T A G R I I A A T G A T A G A C T T A G C * * * * A A T G A T A ED GED Given frequency vectors of two strings x and y, GED(x,y) is normally distributed. Q = [12, 10, 3, 5] U = [11, 11, 4, 4] V = [6, 5, 9, 10]

83 Mean : Variance :

84 Bounded Alignment: lower bounding the alignment M i,j = Edit distance between prefixes of X and Y d = lower bound to ED between suffixes of X and Y with at least p% probability. If (Mi,j + d > cutoff) then –No solution exists from (i,j) with p% probability. –Remove entry (i,j) i j X Y d M i,j p % d

85 Cost of computing lower bound? Frequency vectors can be computed in O(1) time incrementally. A A T T C [2 1 0 2] [A C G T] [1 1 0 2] A A T T C

86 Unbounded Alignment: upper bounding the alignment i j X Y D ij M i,j D ij = upper bound to the distance between suffixes. Use min i,j {M i,j + D ij } as cutoff. Prune if (Mi,j + d > cutoff) D ij : desirable if it is Computed quickly Tight

87 How to Compute the Upper Bound? i j X Y D ij M i,j X Y D i,j D ij = distance for a sample alignment (suffix) - A A C C T C G C A T C T A e.g. D i,j = 4

88 Cost of Computing Upper Bound? Upper bound can be computed in O(1) time incrementally. A A T C T G - C T C A G A A T C T G - - T C A G A T C T G C T C A G D = 3 D = 2 A A T C T G T C A G C

89 Optimization 3: Path Prune X X No solution exists from entry (i, j) if its path to entry (0, 0) is blocked. Remove entry (i,j)

90 Optimization 3: Path Prune XX XX XXX XXX No solution exists from entry (i, j) if its path to entry (0, 0) is blocked. Remove entry (i,j)

91 Unbounded Alignment: Time

92 Unbounded Alignment: Space

93 Bounded Alignment: Time

94 Bounded Alignment: Space

95 Recap & Goals Other important sequence comparison problems –banded alignment –end free search –pattern search –non-overlapping alignments –gaps –linear-space algorithms –inversions –bitwise operations –neighborhood searching –NFAs –Homology

96 What is Similarity Anyway ? Similar: have similar letters Homolog: have common ancestor Not exactly the same ! Three types of homology –Paralog –Ortholog –Xenolog Organism A Organism B Parent Organism

97 Paralog & Ortholog (1) "Two genes are said to be paralogous if they are derived from a duplication event, but orthologous if they are derived from a speciation event.“ W-H Li 1.A gene called A in species w 2.is duplicated producing initially two copies of A. 3.With time the two copies diverge by evolution forming related genes A1 and A2. These two genes are said to be paralogous to one another. Paralogy typically involves comparisons within a species.

98 Paralog & Ortholog (2) Two species, x and y evolve from species w, their common ancestor. The descendants of the A1 and A2 genes are now called A1x, A1y, and A2x, A2y to reflect which species they now occupy. A1x is orthologous to A1y and A2x is orthologous to A2y. The comparison is between two species.

99 Xenolog Xenology is defined as that condition (horizontal transfer) where the history of the gene involves an interspecies transfer of genetic material. It does not include transfer between organelles and the nucleus. It is the only form of homology in which the history has an episode where the descent is not from parent to offspring but, rather, from one organism to another.

100 Paralog, Ortholog, Xenolog Paralog Ortholog Xenolog

101 Recommended Reading Fitch, WM, “Homology a personal view on some of the problems”, Trends. Genet., 2000, 16: 227-231

102 Overview Dot plots Dynamic programming solutions –Local, global alignments and their extensions Distance and similarity models Gap models Improvements and different on computation of sequence similarity Similarity versus homology

103 Next: Substitution Patterns Predict substitutions What are scoring matrices and how are they derived?

1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.

Similar presentations

Presentation on theme: "1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida.

Similar presentations

Presentation on theme: "1 CAP5510 – Bioinformatics Sequence Comparison Tamer Kahveci CISE Department University of Florida."— Presentation transcript:

Similar presentations

About project

Feedback