Presentation is loading. Please wait.

Presentation is loading. Please wait.

Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),

Similar presentations


Presentation on theme: "Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),"— Presentation transcript:

1 Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS)

2 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

3 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

4 Mouse & Human Do they look like the same? Mouse and human are genetically very similar What do we mean by similar? Many genes that can be found in human are also found in mouse as well – conserved genes Mouse Chromosome 16 Human Chromosome 16 m16 h03

5 Genome A Genome B Gene X Gene Y Gene Z Identify regions on the genomes that possibly contain their conserved genes. Whole Genome Alignment Difference in ordering of conserved could be related to mutations. For related species, num. of mutations is usually small. possibly a mutation

6 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

7 Data size  Usually very large (e.g., human chromosomes vs mouse chromosomes) Examples Human Chr No. Length Mouse Chr No. Length 1 245M 1 134M 3 200M 2 181M M 7 134M M 8 129M 20 64M 16 99M Cannot use global alignment tools because of the large size

8 Observations  a conserved gene may not be identical in the two genomes, nevertheless, there are some common substrings unique to this conserved gene (called MUM)  locate all MUMs over the two genomes, yet not every MUM corresponds to conserved genes Gene X Gene Y Gene X Noise

9 Number of MUMs Mouse Chr No. Human Chr No. # of MUMs 71952, , , , , , ,814 Size is smaller comparing with chromosome length

10 MUMs for M16-H03 Conserved genes Human Chromosome 03 Mouse Chromosome 16

11 Generation of MUM using suffix tree How to choose the right MUMs?

12 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

13 MUM Selection  MUMmer-1 [Delcher et al. Nucleic Acids Research 1999]  longest common subsequences (effectively assume no mutations)  MUMmer-2 [Delcher et al. Nucleic Acids Research 2002] & MUMmer-3 [Kurtz et al. Genome Biology 2004]  clustering heuristics  most popular tool to uncover conserved genes in WG scale  MaxMinCluster [Wong et al. Bioinformatics 2004*]  clustering, optimization  MSS Mutation Sensitive Selection [Chan et al. Bioinformatics 2005*]  capture mutations  Hybrid approach [Chan et al. Bioinformatics 2005*]  combine mutation sensitive and clustering approaches * our results

14 Overview of Results  Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53)  coverage: % of published conserved genes reported  sensitivity: % of MUMs reported that reside in published conserved genes

15 Overview of Results  Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53)  coverage: % of published conserved genes reported  sensitivity: % of MUMs reported that reside in published conserved genes MSS outperforms MaxMinCluster and MUMmer-3 on closely related species

16 Overview of Results  Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53)  coverage: % of published conserved genes reported  sensitivity: % of MUMs reported that reside in published conserved genes BUT MSS performs worse on species relatively farther apart

17 Overview of Results  Average coverage (sensitivity) — in % Mouse/HumanIntragenusBaculoviradeIntergenusBaculovirade MUMmer-3 77 (27) 66 (71) 43 (62) MaxMinCluster 84 (29) 69 (75) 45 (59) MSS 91 (29) 79 (75) 36 (53) MUMmer-3+MSS 91 (28) 79 (75) 48 (43) MaxMinClustesr+MSS 91 (27) 79 (82) 51 (53)  coverage: % of published conserved genes reported  sensitivity: % of MUMs reported that reside in published conserved genes both hybrid approaches perform well for species farther apart

18 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

19 Longest Common Subsequence LCS

20 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks LCS Approach (MUMmer-1) does not take mutations into account  MUMmer-2 & -3 cluster by heuristic combinatorial optimization problem  MaxMinCluster formalizes clustering as a combinatorial optimization problem

21 Clustering approach  Observations  Noise MUMs are usually short and isolated  A conserved gene usually contains a sequence of MUMs that are close and have sufficient length => clusters Gene X Gene Y Gene X Noise

22 Challenge  Challenge: some conserved genes do not induce clusters of sufficient length Solution: relax the definition of clusters to allow the presence of noise

23 Noisy cluster  Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 1-noisy cluster

24 Noisy cluster  Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 2-noisy cluster

25 MaxMinClustesr  Problem formulation  find a collection of k-noisy clusters such that the smallest cluster has the maximum weight  Dynamic programming O(k 2 n 2 ) time, O(k 2 n) space

26 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks Capture mutations more directly

27 Mutation Sensitive Selection  select subsets of MUMs subset of MUMs transformed by a few mutations  three types of mutations: reversal, transposition, reversed-transposition

28 k-mutated subsequences  Given two sequences A & B and an integer k,  a pair of subsequence X of A & subsequence Y of B is called a pair of k-mutated subsequences if X can be transformed to Y by at most k mutations reversaltransposition a pair of 2-mutated subsequences MUMs are signed; reversal reverts sign of MUMs

29 Mutation Sensitive Selection  Problem formulation:  To find a pair of k-mutated subsequences with maximum weight  We believe that the problem is NP-hard  The Genome Rearrangement Problem, believed to be NP-hard, can be reduced to this problem  We give an efficient approximation algorithm  the resulting weight is close to (at least 1/(3k+1) times) the maximum possible weight O(n 2 log n + kn 2 ) time, O(n 2 ) space

30 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

31 Hybrid Approach  first apply clustering approach to identify clusters which are obviously conserved genes  can apply either MUMmer-3 or MaxMinCluster  these clusters are treated as MUM with bigger weight  then apply MSS to process these MUM together with the remaining MUM

32 Outline  Motivation  Challenges of Whole Genome Alignment  Four approaches and their performance  Longest Common Subsequence  Clustering Approach  Mutation Sensitive Selection  Hybrid Approach  Remarks

33 Remarks  Experiments show that  MaxMinCluster > LCS  MMS > MaxMinCluster for closely related species  MMS does not perform well for species relatively farther apart  Hybrid approach is the best for both closely related and farther apart species

34 Thank you! Q & A

35 Approximation Algorithm  Super-Backbone  maximum weight common subsequences  Identify k mutation blocks  having high weight  do not overlap with Super-Backbone too much  this is formulated as a sub-problem and solved optimally by dynamic programming  Report Super-Backbone & k mutation blocks O(n 2 log n + kn 2 ) time, O(n 2 ) space

36 Mutations  three types of mutations: reversal, transposition, reversed-transposition a b c d e f g h i j k l m n o p q r s t u v w x y z a d c b e f g h i j k l m n o p q r s t u v w x y z reversal a d c b e k l m n o p q r s t u v w x y f g h i j z transpositionreversed-transposition a d c b e k l t s r q p o m n u v w x y f g h i j z


Download ppt "Locating conserved genes in whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU),"

Similar presentations


Ads by Google