Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.

Slides:



Advertisements
Similar presentations
Algorithm Analysis Input size Time I1 T1 I2 T2 …
Advertisements

Part VI NP-Hardness. Lecture 23 Whats NP? Hard Problems.
Hash Tables CS 310 – Professor Roch Weiss Chapter 20 All figures marked with a chapter and section number are copyrighted © 2006 by Pearson Addison-Wesley.
College of Information Technology & Design
Parameterized Matching Amir, Farach, Muthukrishnan Orgad Keller Modified by Ariel Rosenfeld.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
CSE-700 Parallel Programming Assignment 6 POSTECH Oct 19, 2007 박성우.
Analysis of Algorithms
1 Parallel Parentheses Matching Plus Some Applications.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Common Intervals in Sequences, Trees, and Graphs Steffen Heber and Jiangtian Li.
1 CSCI-2400 Models of Computation. 2 Computation CPU memory.
CSE115/ENGR160 Discrete Mathematics 03/03/11 Ming-Hsuan Yang UC Merced 1.
Introduction to Structured Query Language (SQL)
Sets and Maps Chapter 9. Chapter 9: Sets and Maps2 Chapter Objectives To understand the Java Map and Set interfaces and how to use them To learn about.
Boyer-Moore string search algorithm Book by Dan Gusfield: Algorithms on Strings, Trees and Sequences (1997) Original: Robert S. Boyer, J Strother Moore.
An efficient algorithm for optimizing whole genome alignment with noise P. Wong, T. Lam, N. Lu, H. Ting, and S. Yiu Department of Computer Science, University.
CSE 830: Design and Theory of Algorithms
Program Design and Development
Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Carnegie Mellon University Pittsburgh, PA.
COMP 171 Data Structures and Algorithms Tutorial 10 Hash Tables.
Using PQ Trees For Comparative Genomics - CPM Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson.
1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.
"Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Chapter 2: Algorithm Discovery and Design
February 17, 2015Applied Discrete Mathematics Week 3: Algorithms 1 Double Summations Table 2 in 4 th Edition: Section th Edition: Section th.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
2.3 Functions A function is an assignment of each element of one set to a specific element of some other set. Synonymous terms: function, assignment, map.
Lecture 23: Finite State Machines with no Outputs Acceptors & Recognizers.
ALG0183 Algorithms & Data Structures Lecture 6 The maximum contiguous subsequence sum problem. 8/25/20091 ALG0183 Algorithms & Data Structures by Dr Andy.
Fall 2002CMSC Discrete Structures1 Enough Mathematical Appetizers! Let us look at something more interesting: Algorithms.
MCA-2012Data Structure1 Algorithms Rizwan Rehman CCS, DU.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
Dynamic Programming Louis Siu What is Dynamic Programming (DP)? Not a single algorithm A technique for speeding up algorithms (making use of.
Data Structures and Algorithms Lecture 1 Instructor: Quratulain Date: 1 st Sep, 2009.
Chapter Algorithms 3.2 The Growth of Functions 3.3 Complexity of Algorithms 3.4 The Integers and Division 3.5 Primes and Greatest Common Divisors.
1 Section 2.1 Algorithms. 2 Algorithm A finite set of precise instructions for performing a computation or for solving a problem.
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Output Sensitive Algorithm for Finding Similar Objects Jul/2/2007 Combinatorial Algorithms Day Takeaki Uno Takeaki Uno National Institute of Informatics,
Computing Machinery Chapter 4: Finite State Machines.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
Lab 6 Problem 1: DNA. DNA Given a string with length N, determine the number of occurrences of some given substrings (with length K) in that string. For.
1 Modelling using Sequences Sequences The finite sequence constructor Value definitions: enumeration, subsequence Operators on Sequences Case Study: the.
Sets and Maps Chapter 9. Chapter Objectives  To understand the Java Map and Set interfaces and how to use them  To learn about hash coding and its use.
Section Recursion 2  Recursion – defining an object (or function, algorithm, etc.) in terms of itself.  Recursion can be used to define sequences.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Introduction toData structures and Algorithms
The NP class. NP-completeness
Digital Image Processing Lecture 15: Morphological Algorithms April 27, 2005 Prof. Charlene Tsai.
Applied Discrete Mathematics Week 2: Functions and Sequences
Reconstructing the Evolutionary History of Complex Human Gene Clusters
CMPS 3130/6130 Computational Geometry Spring 2017
Part VI NP-Hardness.
Applied Discrete Mathematics Week 6: Computation
Algorithm Discovery and Design
CSE 589 Applied Algorithms Spring 1999
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
Digital Image Processing Lecture 15: Morphological Algorithms
How to use hash tables to solve olympiad problems
Dynamic Programming II DP over Intervals
Discrete Mathematics 7th edition, 2009
Presentation transcript:

Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 2 Overview: Introduction Formal Model Algorithms Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 3 Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 4 Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 5 Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 6 ? Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 7 ? Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 8 ? Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 9 Are there more clusters ? Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 10 Are there more clusters ? Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 11 Task: Establish a model and search for gene clusters Gene Order and Function in Bacteria:

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 12 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 13 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n π1π1 π2π2 π3π3 π4π

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 14 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n π1π1 π2π2 π3π3 π4π

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 15 Formalization of Gene Clusters: π1π1 π2π2 π3π3 π4π4 Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 16 Formalization of Gene Clusters: π1π1 π2π2 π3π3 π4π4 Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 17 Formalization of Gene Clusters: π1π1 π2π2 π3π3 π4π4 Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 18 Formalization of Gene Clusters: Algorithms: - Uno & Yagiura, Algorithmica 2000: Find all common intervals of two permutations in O(n+| output |) time. - Heber & Stoye, CPM 2001: Find all common intervals of k ≥ 2 permutations in O(kn+| output |) time. Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 19 Modeling multiple copies of a gene (paralogs): Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair π1π1 π2π2 π3π3 7?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 20 Modeling multiple copies of a gene (paralogs): Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair π1π1 π2π2 π3π3 ?7

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 21 Modeling multiple copies of a gene (paralogs): Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair π1π1 π2π2 π3π ?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 22 Modeling multiple copies of a gene (paralogs): Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair π1π1 π2π2 π3π3 3 ? 21 ?

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 23 Modeling multiple copies of a gene (paralogs): Solution: - Do not distinguish between paralogous gene copies - Each paralogous copy of a gene gets the same number Consequence: - Genomes are modeled as sequences instead of permutations S1S1 S2S2 S3S

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 24 Overview: Introduction - Comparative genomics - Common Intervals and Gene Clusters Formal Model Algorithms - Simple Data Structure: Quadratic Space - Saving Space Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 25 Formal Model: Given: String S over a finite alphabet Σ Notation: S[i] = the i- th character of S S[i,j] = substring of S starting at index i and ending at j Definition: The character set CS(S[i,j]) := {S[k] | i ≤ k ≤ j} is the set of all characters occurring in the substring S[i,j]. Example: CS(S[2,5]) := {1,2,3} S :

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 26 Formal Model: Given: Subset C  Σ Definition: (i, j) is a CS-location of C in S, iff CS(S[i,j]) = C left-maximal = S[i-1]  C right-maximal = S[j+1]  C maximal = both left- and right-maximal Example: S : The pair (3,5) is a CS-location of the set C={1,2,3}, because CS(S[3,5]) = {1,2,3}, but it is not left- maximal !

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 27 Formal Model: Given: Collection of k strings S* = (S 1,...,S k ) over alphabet Σ Definition: C  Σ is a common CS-factor of S* if and only if C has a CS-location in each S l, 1 ≤ l ≤ k. Example: S 1 : S 2 : S 3 : common CS-factor: {1,3,5} => S 1 : (3,7) ― S 2 : (2,6) ― S 3 : (2,5)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 28 Problem Formulation: A common CS-factor of k strings represents a gene cluster that occurs in each of the k genomes. Given a collection of k strings S* : Problem 1: Find all common CS-factors in S*. Problem 2: For each common CS-factor find all its maximal CS-locations in each of the strings.

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 29 Overview: Introduction Formal Model Algorithms - Simple Data Structure: Quadratic Space - Saving Space Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 30 Algorithm "Connecting Intervals" (CI) Algorithm CI solves Problem 1 and Problem 2 for two sequences Input: Two sequences of length up to n with characters drawn from Σ = {1,...,m}, m ≤ 2n Output: Pairs of CS-locations of all common CS-factors Time & Space complexity: O( n² )

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 31 Preprocessing POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = NUM(i,j) : i j POS[c] holds all positions where character c occurs in S 1. NUM(i,j) counts the number of different characters in S 1 [i,j]. Compute two tables for S 1 = (3,1,2,3,1,5,2,6)

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 32 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : S 1 : ji POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = NUM(i,j) : i j

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 33 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : S 1 : ji NUM(i,j) : i j POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 34 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : S 1 : ji NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 35 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : j

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 36 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : j

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 37 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : j

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 38 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) ((1,5)-(4,6)) i S 2 : j

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 39 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j i S 2 : j (i,j) not left-maximal !

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences for i = 1,...,|S 2 | do 2. j = i 3. while j < |S 2 | and ( i,j ) is maximal do 4. if ( c = S 2 [j] ) is seen the first time 5. for each entry in POS ( c ) do 6. mark and track 7. end for 8. end if 9. j = j end while 11. end for Time Complexity Algorithm CI finds all common CS-factors of S 1 and S 2 in O( n² ) time. POS[1] = 1,4 POS[2] = 2,6 POS[3] = 0,3 POS[4] = empty POS[5] = 5 POS[6] = 7 S 2 :

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 41 Multiple Genomes Goal : Find all common CS-factors of a collection S*=(S 1,S 2,...,S k ) Algorithm : 1.Apply Algorithm CI to all pairs ( S 1, S l ), 2 ≤ l ≤ k 2.Output only the common CS-factor detected in all pairs Time complexity : O ( kn² ) Space complexity : O( kn² ) with redundant output, O( n² ) otherwise Further extension : Find all common CS-factors appearing in at least k' of k strings of S* Time complexity : O ( k(1+k-k')n² )

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 42 Saving Space Due to the storage of the table NUM, Algorithm CI requires quadratic space. An algorithm presented by Didier, WABI 2003, detects all common CS-factors of two sequences in O( n² log n ) time and linear space In a modified version, replacing a binary search by a constant time Range Maximum Query, it is possible to reduce the time complexity to O( n² ) staying still linear in space.

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 43 Overview: Introduction - Comparative genomics - Common Intervals and Gene Clusters Formal Model Algorithms - Simple Data Structure: Quadratic Space - Saving Space Results

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 44 Results on real data Data set: - 43 bacterial genome sequences from NCBI - All classified in the "Clusters of Orthologous Groups of Proteins" database (COG) - Genes are identified by their COG number - Computation time: approx minutes on a standard PC

Results on real data ( k'= 2 ) all 43 genomes cluster size ≥ 3 without closely related genomes (k = 32) cluster size ≥ 2 cluster size ≥ 3 cluster size ≥ 2

Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 46 Teşekkür ederim !