"Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial.

Slides:



Advertisements
Similar presentations
Part VI NP-Hardness. Lecture 23 Whats NP? Hard Problems.
Advertisements

College of Information Technology & Design
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
CSE-700 Parallel Programming Assignment 6 POSTECH Oct 19, 2007 박성우.
Midwestern State University Department of Computer Science Dr. Ranette Halverson CMPS 2433 – CHAPTER 4 GRAPHS 1.
1 1 CDT314 FABER Formal Languages, Automata and Models of Computation Lecture 3 School of Innovation, Design and Engineering Mälardalen University 2012.
Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),
1 Parallel Parentheses Matching Plus Some Applications.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
The Design and Analysis of Algorithms
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
Gene Prediction: Similarity-Based Approaches (selected from Jones/Pevzner lecture notes)
Finite Automata Great Theoretical Ideas In Computer Science Anupam Gupta Danny Sleator CS Fall 2010 Lecture 20Oct 28, 2010Carnegie Mellon University.
CS5371 Theory of Computation
Common Intervals in Sequences, Trees, and Graphs Steffen Heber and Jiangtian Li.
1 CSCI-2400 Models of Computation. 2 Computation CPU memory.
ECE C03 Lecture 111 Lecture 11 Finite State Machine Optimization Hai Zhou ECE 303 Advanced Digital Design Spring 2002.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
An efficient algorithm for optimizing whole genome alignment with noise P. Wong, T. Lam, N. Lu, H. Ting, and S. Yiu Department of Computer Science, University.
Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Carnegie Mellon University Pittsburgh, PA.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Algorithms and Efficiency of Algorithms February 4th.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Using PQ Trees For Comparative Genomics - CPM Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson.
1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Chapter 2: Algorithm Discovery and Design
Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.
Important Problem Types and Fundamental Data Structures
Great Theoretical Ideas in Computer Science.
Chapter 3: The Fundamentals: Algorithms, the Integers, and Matrices
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Lecture 23: Finite State Machines with no Outputs Acceptors & Recognizers.
Finding repeat pattern in human genome by TEIRESIAS algorithm Xiaojun Hu.
ALG0183 Algorithms & Data Structures Lecture 6 The maximum contiguous subsequence sum problem. 8/25/20091 ALG0183 Algorithms & Data Structures by Dr Andy.
Great Theoretical Ideas in Computer Science.
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan.
Dynamic Programming Louis Siu What is Dynamic Programming (DP)? Not a single algorithm A technique for speeding up algorithms (making use of.
1 CompSci 105 SS 2005 Principles of Computer Science Lecture 6: Recursion Lecturer: Santokh Singh Assignment 1 due tomorrow. Should have started working.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Sequencing The most simple type of program uses sequencing, a set of instructions carried out one after another. Start End Display “Computer” Display “Science”
1 String Processing CHP # 3. 2 Introduction Computer are frequently used for data processing, here we discuss primary application of computer today is.
Computer Science Background for Biologists CSC 487/687 Computing for Bioinformatics Fall 2005.
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
Computing Machinery Chapter 4: Finite State Machines.
Splicing Exons: A Eukaryotic Challenge to Gene Prediction Ian McCoy.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Great Theoretical Ideas in Computer Science for Some.
Finite Automata Great Theoretical Ideas In Computer Science Victor Adamchik Danny Sleator CS Spring 2010 Lecture 20Mar 30, 2010Carnegie Mellon.
Example 2 You are traveling by a canoe down a river and there are n trading posts along the way. Before starting your journey, you are given for each 1
Introduction toData structures and Algorithms
Design and Analysis of Approximation Algorithms
Digital Image Processing Lecture 15: Morphological Algorithms April 27, 2005 Prof. Charlene Tsai.
Reconstructing the Evolutionary History of Complex Human Gene Clusters
Part VI NP-Hardness.
Great Theoretical Ideas in Computer Science
Algorithm Discovery and Design
CSE 589 Applied Algorithms Spring 1999
3. Brute Force Selection sort Brute-Force string matching
Digital Image Processing Lecture 15: Morphological Algorithms
Trevor Brown DC 2338, Office hour M3-4pm
How to use hash tables to solve olympiad problems
3. Brute Force Selection sort Brute-Force string matching
Dynamic Programming II DP over Intervals
3. Brute Force Selection sort Brute-Force string matching
Presentation transcript:

"Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 3109, pp (2004). Presented by Gangman Yi

Overview Introduction Formal Model Algorithms Assignment

Gene Order & Function in Bacteria: Observations: Gene order in bacterial genomes is weakly conserved Some genes tend to cluster together even in unrelated species Functional association of genes inside a cluster ?

Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n

Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n π1π1 π2π2 π3π3 π4π4

Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n π1π1 π2π2 π3π3 π4π

Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations) π1π1 π2π2 π3π3 π4π4

Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations) π1π1 π2π2 π3π3 π4π4

Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations) π1π1 π2π2 π3π3 π4π4

Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations) Algorithms: Uno & Yagiura, Algorithmica 2000: Find all common intervals of two permutations in O(n+|output|) time. Heber & Stoye, CPM 2001: Find all common intervals of k ≥ 2 permutations in O(kn+|output|) time.

Modeling multiple copies of a gene (paralogs): Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair π1π1 π2π2 π3π3 7?

Modeling multiple copies of a gene (paralogs): Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair π1π1 π2π2 π3π3 ?7

Modeling multiple copies of a gene (paralogs): Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair π1π1 π2π2 π3π ?

Modeling multiple copies of a gene (paralogs): Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair π1π1 π2π2 π3π3 3 ? 21 ?

Modeling multiple copies of a gene (paralogs): Solution: Do not distinguish between paralogous gene copies Each paralogous copy of a gene gets the same number Consequence: Genomes are modeled as sequences instead of permutations S1S1 S2S2 S3S

Formal Model: Given:String S over a finite alphabet Σ Notation:S[i] = the i-th character of S S[i,j] = substring of S starting at index i and ending at j Definition: The character set CS(S[i,j]) := {S[k] | i ≤ k ≤ j} is the set of all characters occurring in the substring S[i,j]. Example: CS(S[2,5]) := {1,2,3} S :

Formal Model: Given: Subset C  Σ Definition: (i, j) is a CS-location of C in S, iff CS(S[i,j]) = C left-maximal = S[i-1]  C right-maximal = S[j+1]  C maximal = both left- and right-maximal Example: The pair (3,5) is a CS-location of the set C={1,2,3}, because CS(S[3,5]) = {1,2,3}, but it is not left-maximal ! S :

Formal Model: Given: Collection of k strings S* = (S1,...,Sk) over alphabet Σ Definition: C  Σ is a common CS-factor of S* if and only if C has a CS-location in each Sl, 1 ≤ l ≤ k. Example: common CS-factor: {1,3,5} => S1: (3,7) ― S2: (2,6) ― S3: (2,5) S 1 : S 2 : S 3 :

Problem Formulation: A common CS-factor of k strings represents a gene cluster that occurs in each of the k genomes. Given a collection of k strings S*: Problem 1: Find all common CS-factors in S*. Problem 2: For each common CS-factor find all its maximal CS-locations in each of the strings.

Algorithm "Connecting Intervals" (CI) Algorithm CI solves Problem 1 and Problem 2 for two sequences Input: Two sequences of length up to n with characters drawn from Σ = {1,...,m}, m ≤ 2n Output: Pairs of CS-locations of all common CS-factors Time & Space complexity: O(n²)

POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j POS[c] holds all positions where character c occurs in S 1. NUM(i,j) counts the number of unique characters in S 1 [i,j]. Compute two tables for S 1 = (3,1,2,3,1,5,2,6) Preprocessing

Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : S 1 : ji POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = NUM(i,j) : i j Algorithm CI

Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : S 1 : ji NUM(i,j) : i j POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 Algorithm CI

Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : S 1 : ji NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 Algorithm CI

Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : j Algorithm CI

Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : j Algorithm CI

Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : j Algorithm CI

Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) ((1,5)-(4,6)) i S 2 : j Algorithm CI

Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 1 : POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j i S 2 : j Algorithm CI

Time Complexity Algorithm CI finds all common CS-factors of S1 and S2 in O(n²) time. 1. for i = 1,...,|S 2 | do 2. j = i 3. while j < |S 2 | and (i,j) is maximal do 4. if (c = S 2 [j]) is seen the first time 5. for each entry in POS(c) do 6. mark and track 7. end for 8. end if 9. j = j end while 11. end for

Multiple Genomes Goal : Find all common CS-factors of a collection S*=(S1,S2,...,Sk) Algorithm : Apply Algorithm CI to all pairs (S1,Sl), 2 ≤ l ≤ k Output only the common CS-factor detected in all pairs Time complexity : O(kn²) Space complexity : O(kn²) with redundant output, O(n²) otherwise Further extension : Find all common CS-factors appearing in at least k' of k strings of S* Time complexity : O(k(1+k-k')n²) Saving space : Due to the storage of the table NUM, Algorithm CI requires quadratic space.

Assignment Make a clustering algorithm. Each sequence S has n unique genes, but the same gene can be in the other sequences. The number of sequences are k. Maximum output size for the cluster has to be m, so each cluster can have at most m genes. Do not consider about the order of genes in each cluster. S1S1 S2S2 S3S3 SkSk n ABDC BCDA ADCB BCAD Max. size for the cluster, m = 4 Output Example EF FE EF FE

Gangman Yi THANK YOU