Presentation is loading. Please wait.

Presentation is loading. Please wait.

Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.

Similar presentations


Presentation on theme: "Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering."— Presentation transcript:

1 Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering University of Connecticut, USA Dagstuhl Seminar, 2010

2 2 Recombination One of the principle genetic forces shaping sequence variations within species Two equal length sequences generate a third new equal length sequence in genealogy Spatial order is important: different parts of genome inherit from different ancestors. 110001111111001 000110000001111 Prefix Suffix Breakpoint 1100 00000001111

3 Ancestral Recombination Graph (ARG) 100100 S1 = 00 S2 = 01 S3 = 10 S4 = 10 Mutations S1 = 00 S2 = 01 S3 = 10 S4 = 11 10 010011 Recombination Network model: beyond tree model 1 00 1 1 0 1010 Assumption: At most one mutation per site

4 4 Reconstruction of Network-based Evolutionary History Input: DNA sequences (haplotypes) or phylogenetic trees Biology: meiotic recombination in populations, or reticulate evolutionary processes: horizontal gene transfer or hybrid speciation Different formulation Reconstruct the network-based evolutionary history (and related problems) Efficiency Accuracy Same objective

5 Reconstructing ARGs by Parsimony Input: a set of binary sequences M Goal: reconstruct ARGs deriving M Parsimony formulation –minARG: Minimize the number of recombination events –NP complete (Wang, et al) 5 Kreitman’s data for adh locus of D. Malonagaster (1983)

6 The minARG Problem Uniform sampling of minARGs by treating each minARG as equally likely (Wu) Estimating the range of minARGs: lower and upper bounds Structural constrained ARGs, e.g. galled trees (Wang, et al, Gusfield, et al). Simplified ARG topology Heuristic methods, e.g. program MARGARITA (Durbin, et al.), Song, et al., Parida, et al. Exact minARG by branch and bound (Lyngso, Song and Hein)

7 minARG for Kreitman’s data Challenge: accurate inference of ARGs R min : minimum number of recombination for M. L(M): lower bound on R min U(M): upper bound on R min Several lower bounds give L(M)=7. U(M)=7 for Kreitman’s data (Song, Wu and Gusfield). Thus, R min (M)=7

8 8 ARG Induces Local Trees 0101101000000110 0100 0000 0010 Local trees: evolutionary history at a genomic position. Trace backwards in time. At recombination node, pick the branch passing alleles to the recombinant at this location. 0110 1010 1110 Data 0000 0101 0110 1110 1010 Local tree near site 3 Mutations Recombination

9 Local Trees Change Across the Genome 0101101000000110 0100 0000 0010 Local trees change when moving across recombination breakpoints. 0110 1010 1110 Data 0000 0101 0110 1110 1010 Local tree near site 2 Spatial property: Nearby local tree tends to be more similar. How good is the inferred ARGs? Compare the inferred local tree topologies with the simulated trees

10 Inferring Local Trees Problem: given binary sequences, infer local tree topologies (one tree for each site, ignore branch length) Parsimony-based approaches Hein (1990,1993), Song and Hein (2005) Wu (2010): shared topological features in nearby trees. Key: local trees have different topology due to recombination Trees or Network? Do not reconstruct full network; local trees are very informative Challenge: How to improve the accuracy? Accuracy: Robinson-Foulds distances between inferred trees and the simulated tree

11 RENT: REfining Neighboring Trees Maintain for each SNP site a (possibly non- binary) tree topology –Initialize to a tree containing the split induced by the SNP Gradually refining trees by adding new splits to the trees –Splits found by a set of rules (later) –Splits added early may be more reliable Stop when binary trees or enough information is recovered 11

12 12 0 0 0 1 0 0 0 0 1 1 0 1 0 1 1 A B C abcdeabcde M A Little Background: Compatibility Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11. Otherwise, p and q are compatible. Easily extended to splits. Sites A and B are compatible, but A and C are incompatible.

13 Fully-Compatible Region: Simple Case A region of consecutive SNP sites where these SNPs are pairwise compatible. –May indicate no topology-altering recombination occurred within the region Rule: for site s, add any such split to tree at s. –Compatibility: very strong property and unlikely arise due to chance. 13 A B C

14 Split Propagation: More General Rule Three consecutive sites A,B and C. Sites A and B are incompatible. Does site C matter for tree at site A? –Trees at site A and B are different. –Suppose site C is compatible with sites A and B. Then? –Site C may indicate a shared subtree in both trees at sites A and B. Rule: a split propagates to both directions until reaching a incompatible tree. 14 A B C

15 1 1 2 2 3 3 4 4 Keep two red edges Keep two black edges Hybridization event: nodes with in-degree two or more 1 1 2 2 3 3 4 4 ρ ρ 1 1 3 3 2 2 4 4 ρ ρ T T’ Reticulate Networks Gene trees: phylogenetic trees from gene sequences - Assume: Binary and rooted - Different topologies at different genes Reticulate evolution: one explanation - Hybrid speciation, horizontal gene transfer Gene A 1: 0 0 0 2: 0 0 1 3: 1 1 0 4: 1 0 0 Gene B 1: 0 0 0 2: 1 0 1 3: 0 1 0 4: 0 0 1 Reticulate network: A directed acyclic graph displaying each of the gene trees

16 The Minimum Reticulation Problem Given: a set of K gene trees G. Problem: reconstruct reticulate networks with Rmin(G), the minimum number, reticulation events displaying each gene tree. NP complete: even for K=2 Current approaches: exact methods for K=2 case (see Semple, et al) impose topological constraints (e.g. galled networks, see Huson, et al.) 1 1 2 2 3 3 4 4 T1T1 1 1 2 2 3 3 4 4 1 1 2 2 4 4 3 3 T2T2 T3T3 1 1 2 2 3 3 4 4 N Challenge: efficient and accurate reconstruction of reticulate network for multiple trees. Close lower and upper bounds for arbitrary number of trees (Wu, 2010)

17 Performance of PIRN: Optimal Solution Lower and upper bounds often match for many data 17 Horizontal axis: number of taxa Vertical axis: % of data LB=UB K: number of trees r: level of reticulation

18 Performance of PIRN: Gap of Bounds Gap between the lower and upper bounds is often small for many data 18 Horizontal axis: number of taxa Vertical axis: gap between lower and upper bounds K: number of trees r: level of reticulation

19 Reticulate Network for Five Poaceae Trees 19 rpoC2 phyB rbcL ndhF ITS Lower bound: 11 Upper bound: 13

20 Reticulate Network for Five Poaceae Trees 20 Upper bound: 13 used in this network

21 21 Acknowledgement More information available at: http://www.engr.uconn.edu/~ywu Research supported by National Science Foundation and UConn Research Foundation

22 Coalescent with Recombination Coalescent theory: define probabilistic distribution of genealogy Likelihood computation for coalescent with recombination Probability of ARGs under certain parameters Likelihood: summation of probability of all the ARGs Challenging: too many ARGs (Lyngso, Song and Hein) Importance Sampling approach: draw samples (ARGs) wrt some probablistic distribution Work well with no recombination Not working well with recombination

23 Coalescent-based ARG Sampling Uniform sampling of minARGs (Wu, 2007) Treat each minARG as equally likely. Algorithm for generating an minARG uniformly at random (exponential time for setting up, but polynomial-time in sampling) Probability of ARGs under certain parameters Challenge: develop a more general ARG sampling method that can efficiently sample ARGs approximately according to coalescent probabilities. minARG A related problem: compute coalescent likelihood with recombination efficiently. Recent work: exact computation of coalescent likelihood under infinite sites model with no recombination (Wu, 2009)

24 The Mosaic Model M: input sequences Assumption: input sequences are descendent of K founder sequences (unknown) Extant sequences: concatenation of exact copies of founder segment (no shift of position) Coloring: assign which position of a sequence is from which founder (color); need consistency M, K=2 0000 0101 0111 1111 1110 breakpoint Total 5 breakpoint

25 The Minimum Mosaic Problem Problem: given a set of binary sequences and the number of founder K, find a K-coloring of these sequences to minimize the number of color change (recombination breakpoints) And find the K founder sequences (not part of input) Inferred founders Data from Rastas and Ukkonen 20 sequences 40 sites 55 breakpoints: minimum number of breakpoints

26 26 The Minimum Mosaic Problem Introduced by Ukkonen (2002) Simple and easier to visualize Main known results –An exponential-time algorithm which runs in polynomial- time algorithm for K=2 (Ukkonen 2002) –An exact method that works for relatively small K and modest-sized data (Wu and Gusfield, 2007) –Haplovisual program and other extensions by Rastas and Ukkonen (2007). –Heuristic algorithm by Roli and Blum (2009) –Lower bounds for the minimum number of breakpoints needed (Wu, 2010) Challenges –Polynomial-time algorithm for K  3? –Concrete applications in biology?


Download ppt "Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering."

Similar presentations


Ads by Google