Presentation is loading. Please wait.

Presentation is loading. Please wait.

BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.

Similar presentations


Presentation on theme: "BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos."— Presentation transcript:

1 BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos Kalpakis

2 BIBE 052 Outline  Biology Review  Motivation  Previous work  Our contribution  Experimental results  Conclusions

3 BIBE 053 Biology Review  living systems are composed of cells the code for the creation of the cells is packed in a molecule called DNA.  DNA consists of four nucleic acids Adenine, Cytosine, Guanine, and Thymine arranged as complementary strands of a double helix.  DNA strand = string of A,C,G, & T’s.

4 BIBE 054 Chromosomes  the genome is arranged as set of distinct chromosomes.  mammals are diploids humans have 22 + x and y chromosomes. chromosomes occur in homologous pairs one homologous chromosome is inherited from each parent homologous chromosomes contain the same genes in the same order (up to mutations)

5 BIBE 055 Single Nucleotide Polymorphisms.  Single Nucleotide Polymorphism (SNP) = mutation of a single base.  evidence suggests that in humans 90% of variation is due to SNPs DNA has long conserved regions punctuated by SNPs  there is one SNP in approximately 1000 bases most SNPS are bi-allelic  at any given locus, only two of the four possible nucleotides are present in 95% of the population  the restriction (projection) of a DNA strand to SNP sites is a haplotype

6 BIBE 056 What are Genotypes?  the genotype of diploid organisms is the conflation of the inherited haplotypes

7 BIBE 057 Genotype & Haplotype Std. Representation  genotypes and haplotypes can be represented as a 0,1,2 vectors independently for each site  identify each one of the two letters that appear in it with 0 or 1  replace each homozygous site with 0/1 using the mapping above  replace heterozygous sites with 2

8 BIBE 058 Haplotypes vs. Genotypes  large scale polymorphism studies such as Linkage Disequilibrium need haplotype information  however, experimentally it is expensive to segregate the haplotypes of the individuals it is easier to observe the genotypes of those individuals  can we find haplotypes from the genotypes computationally? a genotype with h heterozygous sites can be explained (phased) by 2 h-1 different haplotype pairs how do you choose among them?

9 BIBE 059 Haplotype Phasing with Parsimony  in Population haplotyping, given genotypes from different individuals we want to find a set of haplotypes which resolve all the genotypes Recall that there can be many such solutions Experimental evidence suggests that the number of such haplotypes is small  HPP: Haplotype Phasing Problem with Pure Parsimony Given a set of genotypes, find a minimum size set of haplotypes which conflate to produce the given genotypes  other criteria for choosing among possible sets of haplotypes are perfect phylogeny, minimum total pairwise distance, minimum diameter, etc  we focus on HPP problem Lancia, Pinotti, and Rizzi proved that the HPP is NP–complete as well as APX–hard

10 BIBE 0510 Clark’s Rule  Clark (1990) describes a greedy inference rule to find a small set of haplotypes resolving a set of genotypes Starting with a set of haplotypes H that resolves all the homozygous genotypes, do the following  for each unresolved genotype g  if there is a pair (h, h’) that resolves g with h in H, then add h’ to H, else stop  the solution obtained is sensitive to the order in which genotypes are resolved  Clark’s rule may terminate with some genotypes unresolved (orphans) The rule can be modified to include a pair of haplotypes that resolve an orphan genotype, and continue as before

11 BIBE 0511 Gusfield’s TIP  Gusfield (1999) introduces the TIP approach enumerate all distinct haplotypes that can be used to resolve any single heterozygous genotype solve an Integer linear Program (IP) to select a minimum size set haplotypes from the enumerated haplotypes that explains the genotypes TIP uses O(2 L n) variables and constraints, where L is the maximum number of heterozygous loci of any genotype Gusfield describes a number of important improvements to the basic approach above that improve performance

12 BIBE 0512 Harrower-Brown IP  Harrower and Brown give an alternate 0-1 IP for the HPP problem (HB-IP) explain the n genotypes with 2n haplotypes (not necessarily distinct) the number of distinct haplotypes used are minimized the number of variables and constraints is polynomial in n, m

13 BIBE 0513 The QIP approach - Outline  arithmetic representation of genotypes  semidefinite programming (SDP)  Quadratic Integer Program (QIP) for HPP a semidefinite programming based heuristic to solve QIP  experimental results  concluding remarks

14 BIBE 0514 Arithmetic Representation of Genotypes  represent each genotype g as a vector δ with each homozygous locus takes value 0 or 2 iff it was 0 or 1 in g each heterozygous locus takes value 1  conflation can now be replaced by addition if haplotypes h 1 and h 2 explain genotype δ, then  δ = h 1 + h 2 we call δ an arithmetic genotype g = 0 1 2 h 1 = 0 1 0 h 2 = 0 1 1 δ = 0 2 1 h 1 = 0 1 0 h 2 = 0 1 1 g  δ

15 BIBE 0515 Arithmetic Genotypes  let Δ be n x m matrix with the arithmetic genotypes as rows  let H be k x m matrix with haplotypes as rows  if haplotypes in H resolve Δ, then Δ = S H where S is a n x k 0-1-2 matrix  the row of S for a homozygous genotype has a single 2  all other rows have exactly two 1s we call S a selector matrix  i th row of S “selects” two haplotypes (rows of H) to explain i th genotype

16 BIBE 0516 The k-HPP Problem  the k-HPP problem Given nxm matrix Δ representing a set of n distinct genotypes each with m loci Find an nxk 0-1-2 selector matrix S and a kxm 0-1 haplotype matrix H such that  Δ = S H  S has as few non-zero columns as possible  all row-sums of S are 2  HPP is equivalent to k-HPP with k=2n  lower Bounds for HPP is a well known lower bound Lemma: rank(Δ) is a lower bound for HPP  Consider an optimal solution S, H  Since Δ = S H, we know that rank(Δ) = min(rank(S), rank(H)), and thus H must have at least rank(Δ) distinct rows (haplotypes)

17 BIBE 0517 Finding H given Δ and S  given Δ and H to find an S is easy  given Δ and S find an H by solving a 2-SAT problem If genotype i is resolved by haplotypes t and l, then for each locus j, add following clauses  If δ i,j = 0, add two clauses (¬h t,j ) ^ (¬h l,j )  If δ i,j = 2, add two clauses (h t,j ) ^ (h l,j )  If δ i,j = 1, add clauses (h t,j V h l,j ) ^ (¬h t,j V ¬h l,j )  Only one of the h t,j, h l,j must both be 1 2-SAT problem  has km variables and 2nm clauses  can be solved in (almost) linear time  any satisfying assignment gives a resolution of the genotypes

18 BIBE 0518 Quadratic, Vector, and Semi-definite Programs  Quadratic Integer Program Optimize a quadratic objective function subject to quadratic constraints on integer variables Strict, when each term has total degree 0 or 2  Vector program optimize a linear objective function of inner products of vector variables subject to linear constraints on inner products of those variables Strict quadratic programs lead to vector programs (products of variables are mapped to inner products of corresponding vectors)  SDP program optimize a linear objective function of the elements of a matrix X subject to  linear constraints on the elements of X  X being a positive semi-definite matrix Vector programs lead to SDP (X is the matrix of all vector inner products)  SDP programs can be solved in polynomial-time with small numerical errors, thus solving vector programs, thus solving relaxations of strict Quadratic Integer programs  construct an approximate solution to a quadratic integer program from a solution of its relaxation, obtained via SDP

19 BIBE 0519 Quadratic Integer Program for the k-HPP Subject to:

20 BIBE 0520 QIP Heuristic: SDP+Rounding+Backtracking  recursively solve k-HPP using SDP compute vectors for the variables of QIP for each selector variable S i,j, compute  P[S i,j ]=probability that a random hyperplane separates the vectors of S i,j and z variables (ala MAX-CUT) round to 1 the S i,j * with the highest P[S i,j ] residual k-HPP=k-HPP problem with the rounded S i,j ’s fixed to their rounded value if the residual k-HPP is infeasible  round S i,j * to 0 instead  if the new residual k-HPP is still infeasible  backtrack by returning infeasible recursively solve the residual k-HPP

21 BIBE 0521 Experiments  we experiment with three approaches for the HPP problem Clark’s rule LP relaxation of Gusfield’s TIP scheme with simple rounding the QIP heuristic for k–HPP with k = 2n  The MATLAB package SDPT 3.02 is used to solve the SDP relaxation of the problem  all experiments are done on a single CPU MATLAB on a Dual Xeon 2.4 Ghz desktop with 1GB memory

22 BIBE 0522 Experimental Datasets  we use synthetic datasets A and B each with 20 instances for each triplet (n, m, k) = (5, 5, 5), (8, 8, 8), (10, 10, 10), and (15, 15, 15) (and for B, recombination levels ρ = 0, 16 and 40)  generate instances of the HPP problem as follows randomly mate k haplotypes with m loci to produce n genotypes  generation of haplotypes for dataset A each locus of k haplotypes takes value 0/1 with probability ½ independent of other loci and other genotypes  generation of haplotypes for dataset B Use Hudson’s program to generate haplotypes with these parameters  diploid population of size 10 6  mutation rate = 1.5 × 10 -6  recombination levels ρ = 0, 16 and 40 corresponding to crossover probabilities 0, 4 × 10 -6, and 10 -5

23 BIBE 0523 Experimental Results

24 BIBE 0524 QIP Extensions  QIP can be extended to handle many variants of basic k-HPP problem, such as partial Genotypes  Some loci in some genotypes are unknown shared haplotypes  Prior knowledge of shared haplotypes allowing for erroneous genotypes and loci editing allowing for outlier genotypes

25 BIBE 0525 Concluding Remarks  developed arithmetic formulation for the HPP problem provides new lower bound yields simple quadratic IP (QIP) QIP can be extended to handle many variants, incorporate prior information etc  SDP relaxation of QIP that can be solved in polynomial time SDP+rounding+backtracking gives QIP heuristic  experimentally Demonstrate competitiveness of QIP heuristic vs Clark’s rule and Gusfield’s TIP relaxation Show that rank of the genotypes is a tighter lower bound than  future work Analysis of worst-case performance ratio of the QIP heuristic Devise algorithms that scale better

26 BIBE 0526 Thank You ! Questions ?


Download ppt "BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos."

Similar presentations


Ads by Google