Presentation is loading. Please wait.

Presentation is loading. Please wait.

"Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial.

Similar presentations


Presentation on theme: ""Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial."— Presentation transcript:

1 "Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial Pattern Matching, Lecture Notes in Computer Science 3109, pp. 347-358 (2004). Presented by Gangman Yi

2 Overview Introduction Formal Model Algorithms Assignment

3 Gene Order & Function in Bacteria: Observations: Gene order in bacterial genomes is weakly conserved Some genes tend to cluster together even in unrelated species Functional association of genes inside a cluster ?

4 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n

5 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 π4π4

6 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 π4π4 8 7 6 4 5 2 1 3 3 1 2 5 8 7 6 4 6 7 4 2 1 3 8 5

7 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations) 1 2 3 4 5 6 7 8 8 7 6 4 5 2 1 3 3 1 2 5 8 7 6 4 6 7 4 2 1 3 8 5 π1π1 π2π2 π3π3 π4π4

8 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations) 1 2 3 4 5 6 7 8 8 7 6 4 5 2 1 3 3 1 2 5 8 7 6 4 6 7 4 2 1 3 8 5 π1π1 π2π2 π3π3 π4π4

9 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations) 1 2 3 4 5 6 7 8 8 7 6 4 5 2 1 3 3 1 2 5 8 7 6 4 6 7 4 2 1 3 8 5 π1π1 π2π2 π3π3 π4π4

10 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations) Algorithms: Uno & Yagiura, Algorithmica 2000: Find all common intervals of two permutations in O(n+|output|) time. Heber & Stoye, CPM 2001: Find all common intervals of k ≥ 2 permutations in O(kn+|output|) time.

11 Modeling multiple copies of a gene (paralogs): Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 7?

12 Modeling multiple copies of a gene (paralogs): Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 ?7

13 Modeling multiple copies of a gene (paralogs): Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 3 1 2?

14 Modeling multiple copies of a gene (paralogs): Problem: Gene duplication results in multiple copies of a gene inside a genome Difficult to assign the correct gene pair 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 3 ? 21 ?

15 Modeling multiple copies of a gene (paralogs): Solution: Do not distinguish between paralogous gene copies Each paralogous copy of a gene gets the same number Consequence: Genomes are modeled as sequences instead of permutations 1 2 3 4 5 6 7 8 S1S1 S2S2 S3S3 3 1 2 4 8 7 6 1 2 8 7 6 7 5 4 2 1 3

16 Formal Model: Given:String S over a finite alphabet Σ Notation:S[i] = the i-th character of S S[i,j] = substring of S starting at index i and ending at j Definition: The character set CS(S[i,j]) := {S[k] | i ≤ k ≤ j} is the set of all characters occurring in the substring S[i,j]. Example: CS(S[2,5]) := {1,2,3} 1 2 3 4 5 6 7 8 S : 3 1 2 3 1 5 2 6

17 Formal Model: Given: Subset C  Σ Definition: (i, j) is a CS-location of C in S, iff CS(S[i,j]) = C left-maximal = S[i-1]  C right-maximal = S[j+1]  C maximal = both left- and right-maximal Example: The pair (3,5) is a CS-location of the set C={1,2,3}, because CS(S[3,5]) = {1,2,3}, but it is not left-maximal ! S : 3 1 2 3 1 5 2 6 1 2 3 4 5 6 7 8

18 Formal Model: Given: Collection of k strings S* = (S1,...,Sk) over alphabet Σ Definition: C  Σ is a common CS-factor of S* if and only if C has a CS-location in each Sl, 1 ≤ l ≤ k. Example: common CS-factor: {1,3,5} => S1: (3,7) ― S2: (2,6) ― S3: (2,5) 0 1 2 3 4 5 6 7 S 1 : 3 2 1 3 1 5 1 6 S 2 : 4 3 5 5 5 1 4 2 2 S 3 : 7 5 1 5 3 6 5 1 2 3 4 5 6 7 8 9

19 Problem Formulation: A common CS-factor of k strings represents a gene cluster that occurs in each of the k genomes. Given a collection of k strings S*: Problem 1: Find all common CS-factors in S*. Problem 2: For each common CS-factor find all its maximal CS-locations in each of the strings.

20 Algorithm "Connecting Intervals" (CI) Algorithm CI solves Problem 1 and Problem 2 for two sequences Input: Two sequences of length up to n with characters drawn from Σ = {1,...,m}, m ≤ 2n Output: Pairs of CS-locations of all common CS-factors Time & Space complexity: O(n²)

21 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j POS[c] holds all positions where character c occurs in S 1. NUM(i,j) counts the number of unique characters in S 1 [i,j]. Compute two tables for S 1 = (3,1,2,3,1,5,2,6) Preprocessing

22 Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : 4 3 5 5 5 1 4 2 2 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 ji POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 NUM(i,j) : i j Algorithm CI

23 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : 4 3 5 5 5 1 4 2 2 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 ji NUM(i,j) : i j POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 Algorithm CI

24 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : 4 3 5 5 5 1 4 2 2 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 ji NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 Algorithm CI

25 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : 4 3 5 5 5 1 4 2 2 j Algorithm CI

26 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : 4 3 5 5 5 1 4 2 2 j Algorithm CI

27 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : 4 3 5 5 5 1 4 2 2 j Algorithm CI

28 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) ((1,5)-(4,6)) i S 2 : 4 3 5 5 5 1 4 2 2 j Algorithm CI

29 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j i S 2 : 4 3 5 5 5 1 4 2 2 j Algorithm CI

30 Time Complexity Algorithm CI finds all common CS-factors of S1 and S2 in O(n²) time. 1. for i = 1,...,|S 2 | do 2. j = i 3. while j < |S 2 | and (i,j) is maximal do 4. if (c = S 2 [j]) is seen the first time 5. for each entry in POS(c) do 6. mark and track 7. end for 8. end if 9. j = j + 1 10. end while 11. end for

31 Multiple Genomes Goal : Find all common CS-factors of a collection S*=(S1,S2,...,Sk) Algorithm : Apply Algorithm CI to all pairs (S1,Sl), 2 ≤ l ≤ k Output only the common CS-factor detected in all pairs Time complexity : O(kn²) Space complexity : O(kn²) with redundant output, O(n²) otherwise Further extension : Find all common CS-factors appearing in at least k' of k strings of S* Time complexity : O(k(1+k-k')n²) Saving space : Due to the storage of the table NUM, Algorithm CI requires quadratic space.

32 Assignment Make a clustering algorithm. Each sequence S has n unique genes, but the same gene can be in the other sequences. The number of sequences are k. Maximum output size for the cluster has to be m, so each cluster can have at most m genes. Do not consider about the order of genes in each cluster. S1S1 S2S2 S3S3 SkSk n ABDC BCDA ADCB BCAD Max. size for the cluster, m = 4 Output Example EF FE EF FE

33 Gangman Yi Email : gangman@cs.tamu.edu THANK YOU


Download ppt ""Quadratic time algorithms for finding common intervals in two and more sequences" by T. Schmidt and J. Stoye, Proc. 15th Annual Symposium on Combinatorial."

Similar presentations


Ads by Google