Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

Slides:



Advertisements
Similar presentations
Lower Bounds for Local Search by Quantum Arguments Scott Aaronson.
Advertisements

Fast optimal instruction scheduling for single-issue processors with arbitrary latencies Peter van Beek, University of Waterloo Kent Wilken, University.
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination Dan Gusfield UC Davis Different parts of this work are joint.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.
Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA
1. Elements of the Genetic Algorithm  Genome: A finite dynamical system model as a set of d polynomials over  2 (finite field of 2 elements)  Fitness.
Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.
Dispersal models Continuous populations Isolation-by-distance Discrete populations Stepping-stone Island model.
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
March 2006Vineet Bafna CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish.
Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Getting Parameters from data Comp 790– Coalescence with Mutations1.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
California Pacific Medical Center
Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.
Implicit Hitting Set Problems Richard M. Karp Erick Moreno Centeno DIMACS 20 th Anniversary.
Coalescent theory CSE280Vineet Bafna Expectation, and deviance Statements such as the ones below can be made only if we have an underlying model that.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Efficient Point Coverage in Wireless Sensor Networks Jie Wang and Ning Zhong Department of Computer Science University of Massachusetts Journal of Combinatorial.
Recombination and Pedigrees Genealogies and Recombination: The ARG Recombination Parsimony The ARG and Data Pedigrees: Models and Data Pedigrees & ARGs.
Yufeng Wu and Dan Gusfield University of California, Davis
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
An Algorithm for Computing the Gene Tree Probability under the Multispecies Coalescent and its Application in the Inference of Population Tree Yufeng Wu.
WABI: Workshop on Algorithms in Bioinformatics
Erice - Structured Pattern Detection and Exploitation
L4: Counting Recombination events
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Estimating Recombination Rates
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
Kevin Mason Michael Suggs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Recombination, Phylogenies and Parsimony
Minimizing the Aggregate Movements for Interval Coverage
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu, Dan Gusfield UC Davis ISMB 2005

Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species.  Estimating the frequency and the location of recombination is central to modern-day genetics. (e.g. disease association mapping) 11L L 1L b b b

s 1 = s 2 = s 3 = s 4 = s 1 = s 2 = s 3 = s 4 = All four possible gametic types Assumption: at most one mutation per site SNP Sequences Possible gametic types: { 00, 01, 10, 11 } Recombination Mutation

1 = 2 = 3 = 4 = 5 = 6 = 7 = 8 = 9 = 10 = 11 =  Given a set M of sequences, what is the minimum number R min (M) of recombinations needed for constructing evolutionary histories that explain M? Minimizing Recombinations Kreitman’s data from the adh locus of D. Melanogaster (1983) M =  Minimization is NP-hard. (Wang et al 2000, Semple 2004)

Bounds on the Minimum Number R min (M) of Recombinations R min (M) L(M) < M, a set of sequences < U(M) Minimum No efficient method. Lower bound There are many methods. Upper Bound Novel. Our Contribution: Efficient, practical algorithms for computing lower and upper bounds on R min (M). Key idea: If L(M) = U(M), then we know R min (M). Empirical observation: L(M) = U(M) frequently for a surprisingly large range of data.

The Composite Method (Myers & Griffiths 2003) M 1. Given a set of intervals, and Composite Problem: Find the minimum number of vertical lines so that every I intersects at least L(I) vertical lines for each interval I, a number L(I)  Let L(I) be a “local” recombination lower bound for I.  The composite recombination lower bound on R min (M) is given by a solution to the composite problem. 8

Optimal Haplotype Bound as L ( I )  S = A subset of columns in I.  Haplotype Bound h(S) = (Number of distinct rows restricted to S)  (Number of distinct columns in S)  1.  Optimal Haplotype Bound Opt(I) = Maximum value of h(S) over all subsets S of columns in interval I I h(S) = 4  2  1 = 1 h(S) = 6  3  1 = 2

Myers & Griffiths : For every interval I, restrict the maximum size (s) of S and the maximum distance (w) between the leftmost and the rightmost columns in S. I | S | < s, d d < w Implemented in RecMin, along with the composite method. Computing this optimal haplotype bound Opt(I) is NP-hard. (Bafna & Bansal 2005) RecMin is a tremendous improvement over previous practical lower bound methods. But, 1.The user is instructed to experiment with parameters s and w until the bound does not change. 2.That does not guarantee that the bound could not be improved by further increasing the parameters.

(Implemented in HapBound) 1.No parameters. 2.Much faster than RecMin. 3.Implements additional ideas that produce lower bounds even better than the optimal haplotype bound. How to derive sharper bounds? In the composite method, check if each local bound L(I) is in fact equal to R min ( I ), and if not, increase L(I) by one. (  S and  M options)  We cast the problem as a classic set cover problem that can be formulated as an ILP problem, with 1 variable per column and 1 inequality per pair of rows.  We can compute the optimal haplotype bound exactly. Can use either GNU ILP Solver or CPLEX.

RecMin vs. HapBound on the human LPL (Nickerson et al., 1998) ProgramLower Bound Time RecMin –s 8 –w 12 (default)593 sec RecMin  s 25  w sec RecMin  s 48  w 48 No result5 days HapBound7531 sec HapBound  S sec 88 Sequences, 48 sites

Mutation Recombination Upper Bound on R min (M) Branch and Bound construction of genealogies backwards in time (using an alternating series of coalescent, mutation, and recombination events).

 B&B uses recombination lower bounds and randomization.  Implemented in SHRUB (Simulated History Recombination Upper Bound) SHRUB constructs genealogies that can be viewed using an open source program. Contains U(M) recombination vertices.

Kreitman’s ADH data (1983) 11 alleles of the alcohol dehydrogenase locus of Drosophila melanogaster. (43 Sites)  There is only one previous implemented method that computes R min (M) exactly. (Song and Hein, 2003) That method took about 1.5 GB of memory and 30 minutes of CPU time to find R min (M) = 7.

We tried 9 different implemented lower bound methods, aside from HapBound. They all produced either 5 or 6. Time Both HapBound (with –M option) and SHRUB produced 7 and took only a fraction of a second to analyze this data set. L(M) = U(M) = 7  R min (M) = 7. An evolutionary history, found by SHRUB, with 7 recombination events. It corresponds to the most parsimonious history.

The Human LPL Data (Nickerson et al., 1998) (88 Sequences, 48 sites) Our new lower bound HapBound  S  M Upper bound SHRUB (We ignored insertion/deletion, unphased sites, and sites with missing data.) Composite optimal haplotype bounds

Match frequency for simulated data  = scaled recombination rate.  = scaled mutation rate. Frequency of having L(M) = U(M) Used Hudson’s MS to generate1000 simulated datasets for each pair of  and   For  < 5, our lower and upper bounds match over 90% of the time.  This is a significant progress, as there currently exists no other method that can find R min (M) for more than 9 sequences after some data reduction. n = number of sequences

Softwares HapBound and SHRUB can be found at wwwcsif.cs.ucdavis.edu/~gusfield/