Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul.

Similar presentations


Presentation on theme: "Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul."— Presentation transcript:

1 Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul

2 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 2 Overview: Introduction Formal Model Algorithms Results

3 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 3 Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster Gene Order and Function in Bacteria:

4 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 4 Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster Gene Order and Function in Bacteria:

5 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 5 Observations: - Gene order in bacterial genomes is weakly conserved - Some genes tend to cluster together even in unrelated species - Functional association of genes inside a cluster Gene Order and Function in Bacteria:

6 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 6 ? Gene Order and Function in Bacteria:

7 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 7 ? Gene Order and Function in Bacteria:

8 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 8 ? Gene Order and Function in Bacteria:

9 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 9 Are there more clusters ? Gene Order and Function in Bacteria:

10 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 10 Are there more clusters ? Gene Order and Function in Bacteria:

11 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 11 Task: Establish a model and search for gene clusters Gene Order and Function in Bacteria:

12 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 12 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n

13 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 13 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n π1π1 π2π2 π3π3 π4π4 1 2 3 4 5 6 7 8

14 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 14 Formalization of Gene Clusters: Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n π1π1 π2π2 π3π3 π4π4 1 2 3 4 5 6 7 8 8 7 6 4 5 2 1 3 3 1 2 5 8 7 6 4 6 7 4 2 1 3 8 5

15 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 15 Formalization of Gene Clusters: 1 2 3 4 5 6 7 8 8 7 6 4 5 2 1 3 3 1 2 5 8 7 6 4 6 7 4 2 1 3 8 5 π1π1 π2π2 π3π3 π4π4 Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations)

16 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 16 Formalization of Gene Clusters: 1 2 3 4 5 6 7 8 8 7 6 4 5 2 1 3 3 1 2 5 8 7 6 4 6 7 4 2 1 3 8 5 π1π1 π2π2 π3π3 π4π4 Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations)

17 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 17 Formalization of Gene Clusters: 1 2 3 4 5 6 7 8 8 7 6 4 5 2 1 3 3 1 2 5 8 7 6 4 6 7 4 2 1 3 8 5 π1π1 π2π2 π3π3 π4π4 Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations)

18 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 18 Formalization of Gene Clusters: Algorithms: - Uno & Yagiura, Algorithmica 2000: Find all common intervals of two permutations in O(n+| output |) time. - Heber & Stoye, CPM 2001: Find all common intervals of k ≥ 2 permutations in O(kn+| output |) time. Genomes: permutations π 1, π 2,…, π k Genes:numbers 1,…,n Gene cluster: common interval subset of numbers occurring contiguously in all permutations)

19 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 19 Modeling multiple copies of a gene (paralogs): Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 7?

20 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 20 Modeling multiple copies of a gene (paralogs): Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 ?7

21 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 21 Modeling multiple copies of a gene (paralogs): Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 3 1 2?

22 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 22 Modeling multiple copies of a gene (paralogs): Problem: - Gene duplication results in multiple copies of a gene inside a genome - Difficult to assign the correct gene pair 1 2 3 4 5 6 7 8 π1π1 π2π2 π3π3 3 ? 21 ?

23 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 23 Modeling multiple copies of a gene (paralogs): Solution: - Do not distinguish between paralogous gene copies - Each paralogous copy of a gene gets the same number Consequence: - Genomes are modeled as sequences instead of permutations 1 2 3 4 5 6 7 8 S1S1 S2S2 S3S3 3 1 2 4 8 7 6 1 2 8 7 6 7 5 4 2 1 3

24 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 24 Overview: Introduction - Comparative genomics - Common Intervals and Gene Clusters Formal Model Algorithms - Simple Data Structure: Quadratic Space - Saving Space Results

25 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 25 Formal Model: Given: String S over a finite alphabet Σ Notation: S[i] = the i- th character of S S[i,j] = substring of S starting at index i and ending at j Definition: The character set CS(S[i,j]) := {S[k] | i ≤ k ≤ j} is the set of all characters occurring in the substring S[i,j]. Example: CS(S[2,5]) := {1,2,3} 1 2 3 4 5 6 7 8 S : 3 1 2 3 1 5 2 6

26 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 26 Formal Model: Given: Subset C  Σ Definition: (i, j) is a CS-location of C in S, iff CS(S[i,j]) = C left-maximal = S[i-1]  C right-maximal = S[j+1]  C maximal = both left- and right-maximal Example: S : 3 1 2 3 1 5 2 6 1 2 3 4 5 6 7 8 The pair (3,5) is a CS-location of the set C={1,2,3}, because CS(S[3,5]) = {1,2,3}, but it is not left- maximal !

27 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 27 Formal Model: Given: Collection of k strings S* = (S 1,...,S k ) over alphabet Σ Definition: C  Σ is a common CS-factor of S* if and only if C has a CS-location in each S l, 1 ≤ l ≤ k. Example: 0 1 2 3 4 5 6 7 S 1 : 3 2 1 3 1 5 1 6 S 2 : 4 3 5 5 5 1 4 2 2 S 3 : 7 5 1 5 3 6 5 1 2 3 4 5 6 7 8 9 common CS-factor: {1,3,5} => S 1 : (3,7) ― S 2 : (2,6) ― S 3 : (2,5)

28 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 28 Problem Formulation: A common CS-factor of k strings represents a gene cluster that occurs in each of the k genomes. Given a collection of k strings S* : Problem 1: Find all common CS-factors in S*. Problem 2: For each common CS-factor find all its maximal CS-locations in each of the strings.

29 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 29 Overview: Introduction Formal Model Algorithms - Simple Data Structure: Quadratic Space - Saving Space Results

30 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 30 Algorithm "Connecting Intervals" (CI) Algorithm CI solves Problem 1 and Problem 2 for two sequences Input: Two sequences of length up to n with characters drawn from Σ = {1,...,m}, m ≤ 2n Output: Pairs of CS-locations of all common CS-factors Time & Space complexity: O( n² )

31 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 31 Preprocessing POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 NUM(i,j) : i j POS[c] holds all positions where character c occurs in S 1. NUM(i,j) counts the number of different characters in S 1 [i,j]. Compute two tables for S 1 = (3,1,2,3,1,5,2,6)

32 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 32 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : 4 3 5 5 5 1 4 2 2 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 ji POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 NUM(i,j) : i j

33 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 33 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : 4 3 5 5 5 1 4 2 2 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 ji NUM(i,j) : i j POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

34 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 34 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters S 2 : 4 3 5 5 5 1 4 2 2 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 ji NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8

35 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 35 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : 4 3 5 5 5 1 4 2 2 j

36 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 36 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : 4 3 5 5 5 1 4 2 2 j

37 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 37 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) i S 2 : 4 3 5 5 5 1 4 2 2 j

38 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 38 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j Output: ((2,2)-(1,1)) ((2,2)-(4,4)) ((1,5)-(4,6)) i S 2 : 4 3 5 5 5 1 4 2 2 j

39 1 2 3 4 5 6 7 8 1 1 2 3 3 3 4 4 5 2 1 2 3 3 4 4 5 3 1 2 3 4 4 5 4 1 2 3 4 5 5 1 2 3 4 6 1 2 3 7 1 2 8 1 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 39 Algorithm CI Algorithm: While reading S 2, mark in S 1 the observed character and track maximal intervals of marked characters 1 2 3 4 5 6 7 8 S 1 : 3 1 2 3 1 5 2 6 POS[1] = 2,5 POS[2] = 3,7 POS[3] = 1,4 POS[4] = empty POS[5] = 6 POS[6] = 8 NUM(i,j) : i j i S 2 : 4 3 5 5 5 1 4 2 2 j (i,j) not left-maximal !

40 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 40 1. for i = 1,...,|S 2 | do 2. j = i 3. while j < |S 2 | and ( i,j ) is maximal do 4. if ( c = S 2 [j] ) is seen the first time 5. for each entry in POS ( c ) do 6. mark and track 7. end for 8. end if 9. j = j + 1 10. end while 11. end for Time Complexity Algorithm CI finds all common CS-factors of S 1 and S 2 in O( n² ) time. POS[1] = 1,4 POS[2] = 2,6 POS[3] = 0,3 POS[4] = empty POS[5] = 5 POS[6] = 7 S 2 : 4 3 5 5 5 1 4 2 2

41 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 41 Multiple Genomes Goal : Find all common CS-factors of a collection S*=(S 1,S 2,...,S k ) Algorithm : 1.Apply Algorithm CI to all pairs ( S 1, S l ), 2 ≤ l ≤ k 2.Output only the common CS-factor detected in all pairs Time complexity : O ( kn² ) Space complexity : O( kn² ) with redundant output, O( n² ) otherwise Further extension : Find all common CS-factors appearing in at least k' of k strings of S* Time complexity : O ( k(1+k-k')n² )

42 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 42 Saving Space Due to the storage of the table NUM, Algorithm CI requires quadratic space. An algorithm presented by Didier, WABI 2003, detects all common CS-factors of two sequences in O( n² log n ) time and linear space In a modified version, replacing a binary search by a constant time Range Maximum Query, it is possible to reduce the time complexity to O( n² ) staying still linear in space.

43 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 43 Overview: Introduction - Comparative genomics - Common Intervals and Gene Clusters Formal Model Algorithms - Simple Data Structure: Quadratic Space - Saving Space Results

44 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 44 Results on real data Data set: - 43 bacterial genome sequences from NCBI - All classified in the "Clusters of Orthologous Groups of Proteins" database (COG) - Genes are identified by their COG number - Computation time: approx. 5 -10 minutes on a standard PC

45 Results on real data ( k'= 2 ) all 43 genomes cluster size ≥ 3 without closely related genomes (k = 32) cluster size ≥ 2 cluster size ≥ 3 cluster size ≥ 2

46 Thomas Schmidt: Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences 46 Teşekkür ederim !


Download ppt "Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences Thomas Schmidt Jens Stoye CPM 2004, Istanbul."

Similar presentations


Ads by Google