Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Similar presentations


Presentation on theme: "Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004."— Presentation transcript:

1 Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004

2 Comparing Two Strings Definition: A string is a set of consecutive characters. Examples: –“hello world” –“0123456” –DNA sequences –text file

3 Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. Allowed operations: –Insert a character –Delete a character –Replace a character Running time: O(mn) with a dynamic programming algorithm

4 Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = abcdefY = defabc d(X, Y) = ? # operations =

5 Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = bcdefY = defabc d(X, Y) = ? # operations = 1

6 Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = cdef Y = defabc d(X, Y) = ? # operations = 2

7 Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = def Y = defabc d(X, Y) = ? # operations = 3

8 Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = defa Y = defabc d(X, Y) = ? # operations = 4

9 Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = defab Y = defabc d(X, Y) = ? # operations = 5

10 Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = defabcY = defabc d(X, Y) = 6 # operations = 6 Does this seem too high?

11 Edit Distance with Moves d(X, Y): smallest number of operations to make X look like Y. –New operation: move a substring X = abcdefY = defabc d(X, Y) = 1

12 Edit Distance with Moves d(X, Y): smallest number of operations to make X look like Y. –New operation: move a substring Some applications –Computational biology – DNA sequences –Text editing –Webpage updating

13 Edit Distance with Moves The problem is NP-hard Algorithm approximates d(X, Y) deterministically Run time: O(n log n) Edit Sensitive Parsing (ESP) Algorithm: 1.Parse each string into a 2-3 tree 2.Compare nodes (substrings) of the trees to compute edit distance approximation:

14 Edit Distance with Moves Algorithm 1.Parse each string into a 2-3 tree Every node represents a substring X = bagcabagehead bagcabagehea d bagca

15 Edit Distance with Moves Algorithm 1.Parse each string into a 2-3 tree Every node represents aa substring Y = cabageheadbag bgbg acaaehea d

16 Edit Distance with Moves Algorithm 2.Compare nodes (substrings) of the trees to compute edit distance approximation 2.1 Find frequencies of occurrence of each substring. X: 1 1 1 1 1 bagca bagehead 1 1 bagcabagehea d

17 ca ba geh ea db ag 1 1 1 1 1 1 Edit Distance with Moves Algorithm 2.Compare nodes (substrings) of the trees to compute edit distance approximation 2.1 Find frequencies of occurrence of each substring. Y: b a g ca b a g ehea d caba gehea dbag 1 1 1

18 1 1 1 1 1 1 ca ba geh ea db ag Edit Distance with Moves Algorithm 2.Compare nodes (substrings) of the trees to compute edit distance approximation 2.1 Find frequencies of occurrence of each substring. 2.2Subtract characteristic vectors to get approximation for d(X, Y) Bagca bagehead 1 1 bag ca ba geh ead 1 1 1 1 1 caba gehea dbag 1 1 1 - = 10

19 Edit Distance with Moves Algorithm 2.Compare nodes (substrings) of the trees to compute edit distance approximation 2.1 Find frequencies of occurrence of each substring. 2.2Subtract characteristic vectors to get approximation for d(X, Y) Actual edit distance with moves? d(bagcabagehead, cabageheadbag) = 1

20 Edit Distance with Moves Goals for this project: –Implement this algorithm –Test algorithm on DNA sequences Questions to think about: –How accurate is the approximation? –How applicable is this technique for comparing large biological sequences? –This algorithm finds repeating structures within the sequences when comparing them. Do these structures have significance? –Do such structures exist for real sequences?

21 Acknowledgements Mentor: Graham Cormode, DIMACS Postdoc DIMACS REU 2004 References: –Benedetto, D., Caglioti E., Loreto V., “Language Trees and Zipping”. Physical Review Letters, 2002 –Cormode, G., Muthukrishnan, S., “The String Edit Distance Matching Problem with Moves”.


Download ppt "Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004."

Similar presentations


Ads by Google