Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Succinct Representations of Dynamic Strings Meng He and J. Ian Munro University of Waterloo.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Longest Common Subsequence
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
Advanced Algorithm Design and Analysis (Lecture 6) SW5 fall 2004 Simonas Šaltenis E1-215b
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Query Languages: Patterns & Structures. Pattern Matching Pattern –a set of syntactic features that must occur in a text segment Types of patterns –Words:
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
1 How to Perform a Good Presentation of a Paper and How to Read Difficult Papers 李家同 暨南大學.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
1 String Edit Distance Matching Problem With Moves Graham Cormode S. Muthukrishna November 2001.
Chapter 4 : Query Languages Baeza-Yates, 1999 Modern Information Retrieval.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases O. Ozturk and H. Ferhatosmanoglu. IEEE International Symp. on Bioinformatics.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
CMPT-825 (Natural Language Processing) Presentation on Zipf’s Law & Edit distance with extensions Presented by: Kaustav Mukherjee School of Computing Science,
Modern Information Retrieval Chapter 4 Query Languages.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
The Design and Analysis of Algorithms
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Use of Kolmogorov distance identification of web page authorship, topic and domain David Parry Auckland University of Technology New Zealand
Chars and strings Jordi Cortadella Department of Computer Science.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
1 An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Department of Computer Science University of California Santa Barbara.
1 Approximate Algorithms (chap. 35) Motivation: –Many problems are NP-complete, so unlikely find efficient algorithms –Three ways to get around: If input.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Information Retrieval CSE 8337 Spring 2005 Simple Text Processing Material for these slides obtained from: Data Mining Introductory and Advanced Topics.
A Programming Languages Syntax Analysis (1)
Chapter 1 Introduction Major Data Structures in Compiler
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Huffman Codes Juan A. Rodriguez CS 326 5/13/2003.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
An Efficient Index Structure for String Databases Tamer Kahveci Ambuj K. Singh Presented By Atul Ugalmugale/Nikita Rasam 1.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
An Improved Search Algorithm for Optimal Multiple-Sequence Alignment Paper by: Stefan Schroedl Presentation by: Bryan Franklin.
Mark Vorster Supervisor: Prof Philip Machanick. Research Overview Goal  Aid bioinformaticians in research by providing a tool which can identify similar.
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.
Madivalappagouda Patil
The Design and Analysis of Algorithms
Jordi Cortadella Department of Computer Science
Approximate Algorithms (chap. 35)
The short-read alignment in distributed memory environment
13 Text Processing Hongfei Yan June 1, 2016.
String Processing.
Fast Fourier Transform
Definition In simple terms, an algorithm is a series of instructions to solve a problem (complete a task) We focus on Deterministic Algorithms Under the.
String matching.
Chapter 11 Data Compression
Cyclic string-to-string correction
Complement to lecture 11 : Levenshtein distance algorithm
Suffix Arrays and Suffix Trees
File Compression Even though disks have gotten bigger, we are still running short on disk space A common technique is to compress files so that they take.
Chap 3 String Matching 3 -.
Huffman Coding Greedy Algorithm
Presentation transcript:

Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004

Comparing Two Strings Definition: A string is a set of consecutive characters. Examples: –“hello world” –“ ” –DNA sequences –text file

Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. Allowed operations: –Insert a character –Delete a character –Replace a character Running time: O(mn) with a dynamic programming algorithm

Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = abcdefY = defabc d(X, Y) = ? # operations =

Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = bcdefY = defabc d(X, Y) = ? # operations = 1

Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = cdef Y = defabc d(X, Y) = ? # operations = 2

Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = def Y = defabc d(X, Y) = ? # operations = 3

Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = defa Y = defabc d(X, Y) = ? # operations = 4

Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = defab Y = defabc d(X, Y) = ? # operations = 5

Comparing Two Strings If X and Y are strings, how similar are they? Edit distance, d(X, Y) – smallest number of operations needed to make X look like Y. X = defabcY = defabc d(X, Y) = 6 # operations = 6 Does this seem too high?

Edit Distance with Moves d(X, Y): smallest number of operations to make X look like Y. –New operation: move a substring X = abcdefY = defabc d(X, Y) = 1

Edit Distance with Moves d(X, Y): smallest number of operations to make X look like Y. –New operation: move a substring Some applications –Computational biology – DNA sequences –Text editing –Webpage updating

Edit Distance with Moves The problem is NP-hard Algorithm approximates d(X, Y) deterministically Run time: O(n log n) Edit Sensitive Parsing (ESP) Algorithm: 1.Parse each string into a 2-3 tree 2.Compare nodes (substrings) of the trees to compute edit distance approximation:

Edit Distance with Moves Algorithm 1.Parse each string into a 2-3 tree Every node represents a substring X = bagcabagehead bagcabagehea d bagca

Edit Distance with Moves Algorithm 1.Parse each string into a 2-3 tree Every node represents aa substring Y = cabageheadbag bgbg acaaehea d

Edit Distance with Moves Algorithm 2.Compare nodes (substrings) of the trees to compute edit distance approximation 2.1 Find frequencies of occurrence of each substring. X: bagca bagehead 1 1 bagcabagehea d

ca ba geh ea db ag Edit Distance with Moves Algorithm 2.Compare nodes (substrings) of the trees to compute edit distance approximation 2.1 Find frequencies of occurrence of each substring. Y: b a g ca b a g ehea d caba gehea dbag 1 1 1

ca ba geh ea db ag Edit Distance with Moves Algorithm 2.Compare nodes (substrings) of the trees to compute edit distance approximation 2.1 Find frequencies of occurrence of each substring. 2.2Subtract characteristic vectors to get approximation for d(X, Y) Bagca bagehead 1 1 bag ca ba geh ead caba gehea dbag = 10

Edit Distance with Moves Algorithm 2.Compare nodes (substrings) of the trees to compute edit distance approximation 2.1 Find frequencies of occurrence of each substring. 2.2Subtract characteristic vectors to get approximation for d(X, Y) Actual edit distance with moves? d(bagcabagehead, cabageheadbag) = 1

Edit Distance with Moves Goals for this project: –Implement this algorithm –Test algorithm on DNA sequences Questions to think about: –How accurate is the approximation? –How applicable is this technique for comparing large biological sequences? –This algorithm finds repeating structures within the sequences when comparing them. Do these structures have significance? –Do such structures exist for real sequences?

Acknowledgements Mentor: Graham Cormode, DIMACS Postdoc DIMACS REU 2004 References: –Benedetto, D., Caglioti E., Loreto V., “Language Trees and Zipping”. Physical Review Letters, 2002 –Cormode, G., Muthukrishnan, S., “The String Edit Distance Matching Problem with Moves”.