# Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

## Presentation on theme: "Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,"— Presentation transcript:

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu, Dan Gusfield UC Davis ISMB 2005

Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation within species.  Estimating the frequency and the location of recombination is central to modern-day genetics. (e.g. disease association mapping) 11L L 1L b b b

s 1 = s 2 = s 3 = s 4 = 00110011 01000100 s 1 = s 2 = s 3 = s 4 = 00110011 01010101 2 00 10 0100 1 10 11 01 00 0 2 1 All four possible gametic types Assumption: at most one mutation per site SNP Sequences Possible gametic types: { 00, 01, 10, 11 } 1 00 1 1 Recombination Mutation

1 = 2 = 3 = 4 = 5 = 6 = 7 = 8 = 9 = 10 = 11 =  Given a set M of sequences, what is the minimum number R min (M) of recombinations needed for constructing evolutionary histories that explain M? Minimizing Recombinations Kreitman’s data from the adh locus of D. Melanogaster (1983) M =  Minimization is NP-hard. (Wang et al 2000, Semple 2004)

Bounds on the Minimum Number R min (M) of Recombinations R min (M) L(M) < M, a set of sequences < U(M) Minimum No efficient method. Lower bound There are many methods. Upper Bound Novel. Our Contribution: Efficient, practical algorithms for computing lower and upper bounds on R min (M). Key idea: If L(M) = U(M), then we know R min (M). Empirical observation: L(M) = U(M) frequently for a surprisingly large range of data.

The Composite Method (Myers & Griffiths 2003) M 1. Given a set of intervals, and Composite Problem: Find the minimum number of vertical lines so that every I intersects at least L(I) vertical lines. 2 1 2 2 2 3 1 2. for each interval I, a number L(I)  Let L(I) be a “local” recombination lower bound for I.  The composite recombination lower bound on R min (M) is given by a solution to the composite problem. 8

Optimal Haplotype Bound as L ( I )  S = A subset of columns in I.  Haplotype Bound h(S) = (Number of distinct rows restricted to S)  (Number of distinct columns in S)  1.  Optimal Haplotype Bound Opt(I) = Maximum value of h(S) over all subsets S of columns in interval I. 1 0 0 1 0 1 1 1 1 0 1 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 1 1 1 1 1 1 0 1 1 I 1 1 0 0 1 1 1 1 0 1 0 0 0 0 1 0 1 0 0 1 1 1 0 0 0 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 1 1 1 0 1 1 1 1 h(S) = 4  2  1 = 1 h(S) = 6  3  1 = 2

Myers & Griffiths : For every interval I, restrict the maximum size (s) of S and the maximum distance (w) between the leftmost and the rightmost columns in S. I | S | < s, d d < w Implemented in RecMin, along with the composite method. Computing this optimal haplotype bound Opt(I) is NP-hard. (Bafna & Bansal 2005) RecMin is a tremendous improvement over previous practical lower bound methods. But, 1.The user is instructed to experiment with parameters s and w until the bound does not change. 2.That does not guarantee that the bound could not be improved by further increasing the parameters.

(Implemented in HapBound) 1.No parameters. 2.Much faster than RecMin. 3.Implements additional ideas that produce lower bounds even better than the optimal haplotype bound. How to derive sharper bounds? In the composite method, check if each local bound L(I) is in fact equal to R min ( I ), and if not, increase L(I) by one. (  S and  M options)  We cast the problem as a classic set cover problem that can be formulated as an ILP problem, with 1 variable per column and 1 inequality per pair of rows.  We can compute the optimal haplotype bound exactly. Can use either GNU ILP Solver or CPLEX.

RecMin vs. HapBound on the human LPL (Nickerson et al., 1998) ProgramLower Bound Time RecMin –s 8 –w 12 (default)593 sec RecMin  s 25  w 25 757944 sec RecMin  s 48  w 48 No result5 days HapBound7531 sec HapBound  S 781643 sec 88 Sequences, 48 sites

19710365284 Mutation Recombination Upper Bound on R min (M) Branch and Bound construction of genealogies backwards in time (using an alternating series of coalescent, mutation, and recombination events).

 B&B uses recombination lower bounds and randomization.  Implemented in SHRUB (Simulated History Recombination Upper Bound) SHRUB constructs genealogies that can be viewed using an open source program. Contains U(M) recombination vertices.

Kreitman’s ADH data (1983) 11 alleles of the alcohol dehydrogenase locus of Drosophila melanogaster. (43 Sites)  There is only one previous implemented method that computes R min (M) exactly. (Song and Hein, 2003) That method took about 1.5 GB of memory and 30 minutes of CPU time to find R min (M) = 7.

We tried 9 different implemented lower bound methods, aside from HapBound. They all produced either 5 or 6. Time Both HapBound (with –M option) and SHRUB produced 7 and took only a fraction of a second to analyze this data set. L(M) = U(M) = 7  R min (M) = 7. An evolutionary history, found by SHRUB, with 7 recombination events. It corresponds to the most parsimonious history.

The Human LPL Data (Nickerson et al., 1998) (88 Sequences, 48 sites) Our new lower bound HapBound  S  M Upper bound SHRUB (We ignored insertion/deletion, unphased sites, and sites with missing data.) Composite optimal haplotype bounds

Match frequency for simulated data  = scaled recombination rate.  = scaled mutation rate. Frequency of having L(M) = U(M) Used Hudson’s MS to generate1000 simulated datasets for each pair of  and   For  < 5, our lower and upper bounds match over 90% of the time.  This is a significant progress, as there currently exists no other method that can find R min (M) for more than 9 sequences after some data reduction. n = number of sequences

Softwares HapBound and SHRUB can be found at wwwcsif.cs.ucdavis.edu/~gusfield/

Similar presentations