Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome Rearrangement Phylogeny

Similar presentations


Presentation on theme: "Genome Rearrangement Phylogeny"— Presentation transcript:

1 Genome Rearrangement Phylogeny
Robert K. Jansen School of Biology University of Texas at Austin Bernard M.E. Moret Department of Computer Science University of New Mexico Li-San Wang Tandy Warnow Department of Computer Sciences

2 Outline Introduction Genome rearrangement phylogeny reconstruction
Application Other methods Future research

3 New Phylogenetic Signals
Large-throughput sequencing efforts lead to larger datasets Challenge: inferring deep evolutionary events Biologists turning to “rare genomic changes” Rare Large state space High signal-to-noise ratio Potential for clarifying early evolution Best studied: gene order evolution (genome rearrangement)

4 Genomes As Signed Permutations
1 – or 5 – etc.

5 Gene Order Data Rare changes on the genomic scale Large state space
DNA: 4 states/character Protein (amino acid sequence): states/character Circular gene order with 120 genes: High signal-to-noise ratio states/character

6 Genomes Evolve by Rearrangements
Inversion: 1 2 –6 – Transposition: Inverted Transposition:

7 Edit Distances Between Genomes
(INV) Inversion distance [Hannenhalli & Pevzner 1995] Computable in linear time [Moret et al 2001] (BP) Breakpoint distance [Watterson et al. 1982] Computable in linear time NJ(BP): [Blanchette, Kunisawa, Sankoff, 1999] A = B = BP(A,B)=3

8 Our Model: the Generalized Nadeau-Taylor Model [STOC’01]
Three types of events: Inversions (INV) Transpositions (TRP) Inverted Transpositions (ITP) Events of the same type are equiprobable Probabilities of the three types have fixed ratio We focus on signed circular genomes in this talk.

9 Simulation Study Protocol
Synthetic Input Evolutionary Process A B C D E F True Tree A B C D E F Known in simulation Phylogenetic Method D B A C E F Inferred Tree

10 Quantifying Error FN: false negative (missing edge)
1/3=33.3% error rate

11 Outline Genome rearrangement evolution
Genome rearrangement phylogeny reconstruction Application Other methods Future research

12 Gene Order Parsimony

13 Breakpoint Phylogeny [Sankoff & Blanchette 1998]
“Maximum Parsimony”-style problem: Find tree(s), leaf-labeled by genomes, with shortest breakpoint length NP-hard problem on two levels: Find the shortest tree (the space of trees has exponential size) Given a tree, find its breakpoint length (Even for a tree with 3 leaves, but can be reduced to TSP) BPAnalysis [Sankoff & Blanchette 1998] Takes 200 years to compute our 13-taxon dataset on a Sun workstation

14 BPAnalysis A B Y X’ C Z Y’ X’ Tree length evaluation for EVERY tree
Given a fixed tree topology, evaluate the tree length: Iteratively evaluate the median problem (tree length for a 3-leaf tree) A B Y X’ C Z Y’ X’

15 GRAPPA (Genome Rearrangement Analysis under Parsimony and other Phylogenetic Algorithms)
Uses lowerbound techniques to speed up Used on real datasets, producing thousand-fold speedups over BPAnalysis [ISMB’01] Contributors: (led by Bernard Moret at UNM) U. New Mexico U. Texas at Austin Universitá di Bologna, Italy

16 The Circular Lowerbound of the Length of a Tree
Given a tree, we can lowerbound its length very quickly: 1 2 3 4

17 The Lowerbound Technique
Avoid any tree X without potential: tree X whose lowerbound lb(X) is higher than twice the length c(T) of the best tree T Finding a good starting tree quickly is of utmost importance We turn to distance-based methods Neighbor joining (NJ) [Saitou and Nei 1987] Weighbor [Bruno et al. 2000]

18 Additive Distance Matrix and True Evolutionary Distance (T.E.D.)
7 S4 5 5 S 3 1 S 4 S 8 S2 S5 Theorem [Waterman et al. 1977] Given an m×m additive distance matrix, we can reconstruct a tree realizing the distance in O(m2) time.

19 Error Tolerance of Neighbor Joining
Theorem [Atteson 1999] Let {Dij} be the true evolutionary distances, and {dij} be the estimated distances for T. Let be the length of the shortest edge in T. If for all taxa i,j, we have then neighbor joining returns T.

20 BP and INV BP/2 vs K (120 genes) INV vs K
(K: Actual number of inversions) (Inversion-only evolution)

21 NJ(BP) [Blanchette, Kunisawa, Sankoff 1999] and NJ(INV)
Inversion only Transpositions/ inverted transpositions only 120 genes, 160 leaves Uniformly Random Tree

22 Estimate True Evolutionary Distances Using BP
To use the scatter plot to estimate the actual number of events (K): Compute BP/2 From the curve, look up the corresponding value of K (2) (1) BP/2 vs K (120 genes) (K: Actual number of inversions) (Inversion-only evolution)

23 True Evolutionary Distance (t.e.d.) Estimators for Gene Order Data
Exact-IEBP [WABI’01] Approx-IEBP [STOC’01] EDE [ISMB’01] Based on the Expectation of Breakpoint distance (Exact) Breakpoint distance (Approx.) Inversion distance (Approx.) Derivation Analytical Empirical Model knowledge Required Inversion-only IEBP: Inverting the Expected BreakPoint distance EDE: Empirically Derived Estimator

24 True Evolutionary Distance Estimators
BP vs K (120 genes) Exact-IEBP vs K (K: Actual number of inversions) (Inversion-only evolution)

25 Variance of True Evolutionary Distance Estimators
There are new distance-based phylogeny reconstruction methods (though designed for DNA sequences) Weighbor [Bruno et al. 2000] These methods use the variance of good t.e.d.’s, and yield more accurate trees than NJ. Variance estimates for the t.e.d.s [Wang WABI’02] Weighbor(IEBP), Weighbor(EDE) K vs Exact-IEBP (120 genes)

26 Using T.E.D. Helps 5% 120 genes 160 leaves
Uniformly random tree Transpositions/inverted transpositions only (180 runs per figure) 5%

27 Observations EDE is the best distance estimator when used with NJ and Weighbor. True evolutionary distance estimators are reliable even when we do not know the GNT model parameters (the probability ratios of the three types of events).

28 Outline Genome rearrangement evolution
Genome rearrangement phylogeny reconstruction Application Other methods Future research

29 Percentage of Trees Eliminated Through Bounding [ISMB’01]
edge length=2 # taxa 10 20 40 80 160 1% 80% 91% 100% 99% 320 #genes Uses NJ(EDE) as starting tree

30 Campanulaceae cpDNA BPAnalysis takes 2 centuries on a Sun workstation
13 taxa (tobacco as outlier) 105 gene segments GRAPPA finds 216 trees with shortest breakpoint length (out of 654,729,075 trees) Running Time: BPAnalysis takes 2 centuries on a Sun workstation GRAPPA takes 1.5 hours on a 512-node supercluster About 2300-fold speedup on a single node

31 Campanulaceae [Moret et al. ISMB 2001]
Strict consensus of 216 optimal trees found by GRAPPA 6 out of 10 max. edges found Tobacco Platycodon Cyananthus Codonopsis Merciera Wahlenbergia Triodanis Asyneuma Legousia Symphandra Adenophora Campanula Trachelium

32 Outline Genome rearrangement evolution
Genome rearrangement phylogeny reconstruction Application Other methods Future research

33 “Fast” Approaches for Genome Rearrangement Phylogeny
Basic technique: encode data as strings and apply maximum parsimony Running time exponential in the number of genomes, but polynomial in the number of genes (faster than GRAPPA) MPBE [ISMB’00] Maximum Parsimony using Binary Encodings MPME [Boore et al. Nature ’95, PSB’02] Maximum Parsimony using Multi-state Encodings The length of a tree using these two methods is a lowerbound of the true breakpoint length [Bryant ’01]

34 Maximum Parsimony using Binary Encoding (MPBE)
Input genome (circular) A: = -4 –3 –2 –1 B: –2 = C: –4 = –2 -1 MPBE Strings (1,2) (2,3) (3,4) (4,1) (1,-4) (-2,1) (2,-3) (-3,-4) (-4,1) A: B: C:

35 Maximum Parsimony using Multistate Encoding (MPME)
Input genome (circular) A: = -4 –3 –2 –1 B: –2 = C: –4 = –2 -1 MPME Strings We use PAUP to solve Maximum Parsimony => Constraint: number of states per site cannot exceed 32 –2 –3 -4 A: –4 –1 –2 -3 B: – –2 -3 C: 2 – –

36 NJ vs MP (120 genes, 160 genomes)
All three event types equiprobable (datasets that exceed 32-state limit for MPME are dropped)

37 Inversion Phylogeny Inversion median has higher running time than breakpoint median Inversion phylogeny overall has shorter running time than breakpoint phylogeny, and returns more accurate trees [Moret et al. WABI ’02]

38 DCM-GRAPPA [Moret & Tang 2003]
Disk-Covering Method: divide the original problem into subproblems [Huson, Nettles, Parida, Warnow and Yooseph, 1998] Uses inversion distance DCM-GRAPPA: can now process thousands of genomes, each having hundreds of genes

39 Ongoing and Future Research
Genome rearrangement phylogeny with unequal gene content (duplications, deletions, etc.) Non-uniform genome rearrangement models (Segment-length dependent model, hotspots)

40 Acknowledgements University of Texas Tandy Warnow (Advisor) Robert K. Jansen Stacia Wyman Luay Nakhleh Usman Roshan Cara Stockham Jerry Sun University of New Mexico Bernard M.E. Moret David Bader Jijun Tang Mi Yan Central Washington University Linda Raubeson

41 Phylolab Department of Computer Sciences University of Texas at Austin
Please visit us at


Download ppt "Genome Rearrangement Phylogeny"

Similar presentations


Ads by Google