Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using PQ Trees For Comparative Genomics - CPM 20051 Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson.

Similar presentations


Presentation on theme: "Using PQ Trees For Comparative Genomics - CPM 20051 Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson."— Presentation transcript:

1 Using PQ Trees For Comparative Genomics - CPM 20051 Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson Oren Weimann – Univ. of Haifa

2 Using PQ Trees For Comparative Genomics - CPM 20052 Gene Clusters  Genes that appear together consistently across genomes are believed to be functionally related, however the ordering doesn’t have to be the same. Genome 1 Genome 2 Genome 3 Genome 4 Genome 5

3 Using PQ Trees For Comparative Genomics - CPM 20053 What is a  Pattern? [WABI04]  Given a string S=“s 1 s 2 s 3 ….s n ” and an integer K, a pattern P={p 1,p 2,p 3,…,p m } is a  pattern if P occurs (possibly permuted) in at least K places in S.  Example: S =a b c d b a c d a b a c b P = {a,b,c} K=4 P is a 4-  Pattern with location-list = {1,5,10,11}  For the moment we will assume that every character appears once in the pattern.

4 Using PQ Trees For Comparative Genomics - CPM 20054 S = a b c d e b a d c e Maximal  Patterns  Maximal notation - a representation of a maximal  pattern p that illustrates all the non-maximal  patterns with respect to p.  Our goal: Find all  patterns p and their maximal notation.  Our solution – a linear time algorithm based on PQ trees. S = a b c d e b a d c e {a,b} is non-maximal with respect to {a,b,c,d,e} The maximal notation of {a,b,c,d,e} is ((a,b)-(c,d)-e)

5 Using PQ Trees For Comparative Genomics - CPM 20055 PQ trees: Booth, Lueker Definitions  PQ trees [Booth, Lueker, 1976]  Character labeled leaves.  P-nodes: Represent “truly permuted” components Arbitrary permutations of children  Q-nodes: Represent bi-connected components Only “reversion” B EF GH I JK B D AC D

6 Using PQ Trees For Comparative Genomics - CPM 20056 PQ trees: Definitions  Equivalent PQ trees ( ). EF GH I JK B D AC EF GH I JK B D AC

7 Using PQ Trees For Comparative Genomics - CPM 20057 PQ trees: Definitions  FRONTIER:  C(T)= the set of frontiers of all trees equivalent to T: EF GH I JK B D AC FRONTIER(T)=“A B C D E F G H I J K” FRONTIER(T)=“A B C G H I J K E F D" Theorem: If C(T 1 )=C(T 2 ) then T 1 T 2.

8 Using PQ Trees For Comparative Genomics - CPM 20058 Our Use of the PQ tree  Suppose the  Pattern {a,b,c,d} appears in 4 locations as:   = { abcd, acbd, dbca, dcba }. Our goal: C(T) = { abcd, acbd, dbca, dcba }. Write the P-nodes as ‘,’ and the Q-nodes as ‘-’ and get: (a-(b,c)-d) which is exactly the maximal notation of the  Pattern {a,b,c,d} bc a d

9 Using PQ Trees For Comparative Genomics - CPM 20059 The minimal Consensus PQ tree  It is not always possible to find a tree T where  =C(T):  Consider a  Pattern {a,b,c,d} that appears as:  = { abcd, bdac }. { abcd, bdac } C(T)  Given permutations  ={  1,  2,…,  k }, the consensus PQ tree T of  is such that  C(T), and the consensus is minimal when there exists no other T’ such that  C(T’) and |C(T’)| |C(T)|.  The problem of obtaining a maximal notation for a  Pattern is the same as obtaining a minimal consensus PQ tree of all the k occurrences.  Theorem: The minimal consensus PQ tree T is unique. bcad

10 Using PQ Trees For Comparative Genomics - CPM 200510 The original use of the PQ Tree  The consecutive 1’s problem: The restriction sets: F = { {a,b,c}, {b,c}, {b,c,d}, {b} } The solution [Booth, Lueker, 1976]: Reduce(F )=  The result will be C(T), in our case C(T)={abcd, acbd, dbca, dcba} and the tree was constructed in O( ) time (for an n x n matrix) (Reduce(F) by [Booth, Lueker, 1976]) a 1 0 0 0 b 1 1 1 1 c 1 1 1 0 d 0 0 1 0 bc a d

11 Using PQ Trees For Comparative Genomics - CPM 200511 Obtaining the Minimal Consensus PQ tree  Some definitions [Heber, Stoye, 2001]:  Common interval – an interval that appears as a consecutive sequence in all the appearances. [4-8] in the example.  We denote = all Common intervals = { [1-2],[2-3],[1-3],[1-9],[1-8],[4-5],[4-6],[4-7],[4-8],[5-6] }  A list p of common intervals is a chain if every two successive intervals in p have a non-trivial overlap. For example P=([1-2],[2-3])  A common interval is called reducible if there is a chain that generates it, otherwise it is called irreducible. [1-3] is a reducible interval since it can be generated by the irreducible intervals [1-2],[2-3]  We denote = all irreducible intervals of  = { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] } 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33

12 Using PQ Trees For Comparative Genomics - CPM 200512  Theorem: Reduce( ) = Reduce( ) = minimal consensus tree.  The Algorithm:  Compute. { [1-2],[1-8],[2-3],[4-5],[4-8],[4-8],[5-6] }  Compute Reduce( ) to get the minimal consensus tree of .  The  Pattern notation is: ((1-2-3)-(((4-5-6),7),8)-9)  Time Complexity: For a a  pattern of size n that appears in k places it takes a total of O(kn+ ) to compute the maximal notation. Obtaining the Minimal Consensus PQ tree 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33 123 9 8 7 456

13 Using PQ Trees For Comparative Genomics - CPM 200513 Improving the Time Complexity to O(kn)  In Heber & Stoye’s algorithm for obtaining, a data structure S was maintained to hold the chains of the irreducible intervals: = { [1-2],[1-8],[2,3], [4,5],[4,8],[4,8],[5,6] }  REPLACE(S):  Replace every chain by a Q node.  Replace every element that is not a leaf or a Q node and is pointed by a vertical link with a P node. 1 2 3 4 5 6 7 8 9 9 8 4 5 6 7 1 2 3 1 2 3 8 7 4 5 6 9 11 22 33 123 9 8 7 456

14 Using PQ Trees For Comparative Genomics - CPM 200514 Maximal  Patterns and Sub-Trees  A sub-tree of the PQ tree T is obtained by picking a P-node in T with all it’s descendants, or by picking a Q-node in T with any number of consecutive descendants.  Suppose the  Pattern {a,b,c,d} appears in 4 locations as:   = { abcd, acbd, dbca, dcba }.  Theorem: If p1 and p2 are  patterns, and p1 is non-maximal with respect to p2, then the PQ Tree T1 that represents p1 is a sub-tree of the PQ tree T2 that represents p2. bc a d

15 Using PQ Trees For Comparative Genomics - CPM 200515 So what did we achieve?  A first algorithm (and optimal in time) that generates the maximal notation of a pattern.  A “bottom-up” construction of a PQ tree.  A visualization of the inner structure of a pattern.  Filtering of meaningful from apparently meaningless (non- maximal) clusters.  Experimental results that prove this tool can aid in predicting gene functions.  Clustering for the various genome models.

16 Using PQ Trees For Comparative Genomics - CPM 200516 Using Our Tool for Various Genome Models  Genome model I (orthologs only): A sequence is a permutation of the set {1,2…,n}. Only one maximal  pattern {1,2….,n}. In O(kn) time we get a PQ tree that describes all patterns of all sizes and their non-maximal relations.

17 Using PQ Trees For Comparative Genomics - CPM 200517 Using Our Tool for Various Genome Models  Genome model II : A gene may appear once in a sequence or not appear at all in that sequence. We can extend the algorithm to work on sequences that are not permutations of the same set in :  Example: consider the 2 sequences 1 2 3 4 5 6 7 and 1 8 2 4 3 7 6 8 1 2 3 4 5 5’ 6 7 8’ and 5 1 8 8’ 2 4 3 7 6 5’ add characters as needed: Build PQ Tree on the new sequences: 8 34 12 8‘5 The sub-trees that have no red leaves Are all the maximal patterns 5’ 6 7

18 Using PQ Trees For Comparative Genomics - CPM 200518 Using Our Tool for Various Genome Models  Genome model III (paralogs and orthologs): A gene may appear any number of times in a sequence (including zero). The minimal consensus PQ tree is not necessarily unique. Solution:  Example: consider 2 appearances of the  pattern {a,a,b} as  = { aab, baa }: 1.  = { a 1 a 2 b, ba 2 a 1 } C(T)= { a 1 a 2 b, ba 2 a 1 } 2.  = { a 1 a 2 b, ba 1 a 2 } C(T)= { a 1 a 2 b, ba 2 a 1, a 2 a 1 b, ba 1 a 2 } a1a1 a2a2 b b a1a1 a2a2

19 Using PQ Trees For Comparative Genomics - CPM 200519 It


Download ppt "Using PQ Trees For Comparative Genomics - CPM 20051 Using PQ Trees For Comparative Genomics Gad M. Landau – Univ. of Haifa Laxmi Parida – IBM T.J. Watson."

Similar presentations


Ads by Google