Download presentation
Presentation is loading. Please wait.
Published byCoral Bradley Modified over 8 years ago
1
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony Department of Computer Science & Information Engineering, National Taiwan University, Taiwan Department of Biological Sciences, University of Southern California, USA Kun-Mao ChaoTing Chen Yao-Ting Huang
2
National Taiwan University Department of Computer Science and Information Engineering 2 SNPs and Haplotypes A Single Nucleotide Polymorphism (SNP) is a single DNA base variation observed with frequency more than 1% in the population. A haplotype stands for a set of linked SNPs on the same chromosome. SNP 1 SNP 2 SNP 3 -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP 1 SNP 2 SNP 3
3
National Taiwan University Department of Computer Science and Information Engineering 3 Genotype Data v.s. Haplotype Data The use of haplotype information has been limited because the human genome is a diploid. In large sequencing projects, genotype data instead of haplotype data are collected. A C G T AT SNP 1 SNP 2 CG Haplotype data SNP 1 SNP 2 Genotype data or AT CG SNP 1 SNP 2 AG CT SNP 1 SNP 2 We don’t know which haplotype pair is real. ACAC GTGT SNP 1 SNP 2 A T C G SNP 1 SNP 2
4
National Taiwan University Department of Computer Science and Information Engineering 4 Haplotype Inference Inferring the haplotypes for a set of genotypes is called haplotype inference. Many variations of this problem are already shown to be NP-hard. Most combinatorial methods consider the maximum parsimony model to solve this problem. Find a minimum set of haplotypes to resolve all genotypes. A C SNP 1 SNP 2 G T AT CG AG CT
5
National Taiwan University Department of Computer Science and Information Engineering 5 Maximum Parsimony AG h3h3 CT h4h4 AT h1h1 CG h2h2 AT h1h1 AT h1h1 or G1G1 A C SNP 1 SNP 2 G T G2G2 A A SNP 1 SNP 2 T T AG CT AT AT CG Find a minimum set of haplotypes to resolve all genotypes.
6
National Taiwan University Department of Computer Science and Information Engineering 6 Our Results We formulated this problem as an integer quadratic programming (IQP) problem. W proposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem. This algorithm finds a solution of O(log n) approximation. We implemented this algorithm in MatLab and compared with existing statistical and combinatorial methods.
7
National Taiwan University Department of Computer Science and Information Engineering 7 Integer Quadratic Programming (IQP) Given n genotypes and m possible haplotypes. Let x i = 1 if the i-th haplotype is selected. Let x i = -1 if the i-th haplotype is not selected. Minimizing the number of selected haplotypes is to Each genotype must be resolved by at least one pair of haplotypes. G1G1 A C SNP 1 SNP 2 G T AT h1h1 CG h2h2 AG h3h3 CT h4h4
8
National Taiwan University Department of Computer Science and Information Engineering 8 Integer Quadratic Programming (IQP) Maximum parsimony: Solving the IQP problem is NP-hard. Objective Function Constraint Functions to resolve all genotypes. Find a minimum set of haplotypes
9
National Taiwan University Department of Computer Science and Information Engineering 9 The Flow of the Iterative SDP Relaxation Algorithm Integer Quadratic Programming Integral Solution Semidefinite Programming Vector Solution Vector Formulation SDP Solution All genotypes resolved? Relax the integer constraint No, repeat this algorithm. Existing SDP solver Yes, done. Reformulation Randomized rounding Incomplete Cholesky decomposition NP-hardP
10
National Taiwan University Department of Computer Science and Information Engineering 10 Relaxation We relax x i into a (m+1)-dimensional unit vector y i. Replace integer constant 1 with another unit vector y 0 = (1, 0, …, 0). Integer Quadratic ProgrammingVector Formulation Integer Quadratic Programming Vector Formulation
11
National Taiwan University Department of Computer Science and Information Engineering 11 Vector FormulationSemidefinite Programming Let Y = (y 0 y 1 …y m ) T (y 0 y 1 …y m ) = Semidefinite Programming Vector Formulation Reformulation
12
National Taiwan University Department of Computer Science and Information Engineering 12 Vector FormulationSemidefinite Programming Semidefinite Programming Vector Formulation Reformulation
13
National Taiwan University Department of Computer Science and Information Engineering 13 Solving SDP Semidefinite ProgrammingSDP Solution SDP Solution Semidefinite Programming The SDP problem can be solved by algorithms such as the interior point method in polynomial time. We can obtain the matrix solution Y.
14
National Taiwan University Department of Computer Science and Information Engineering 14 Decomposition SDP Solution Vector Solution Semidefinite Solution Recall that Y = (y 0 y 1 …y m ) T (y 0 y 1 …y m ). Use the incomplete Choleskey decomposition method to obtain vector solutions y 0, y 1, …, y m. y 0 = (1, 0, …, 0). y 1 = (0.12, 0.04, …, 0.1). … y m = (0.09, 0.1, …, 0.14). Vector Solution
15
National Taiwan University Department of Computer Science and Information Engineering 15 Randomized Rounding Randomly generate two unit vectors z 1 and z 2. Set x i = 1 if ( z 1 · y i ) ( z 1 · y 0 ) > 0, and ( z 2 · y i ) ( z 2 · y 0 ) > 0. Otherwise, set x i = -1. The integer solution obtained by this rounding method is close to the optimal solution. Integral Solution Vector Solution y 0 = (1, 0, …, 0). y 1 = (0.12, 0.04, …, 0.1). … y m = (0.09, 0.1, …, 0.14). Vector Solution x 1 = 1. x 2 = -1. … x m = 1. Integral Solution
16
National Taiwan University Department of Computer Science and Information Engineering 16 Iterative Process Is any genotype still unresolved? Yes, repeat this algorithm for those unresolved genotypes. No, we are done. Integer Quadratic Programming x 1 = 1. x 2 = -1. … x m = 1. Integral Solution All genotypes resolved? No, repeat this algorithm. Yes, done.
17
National Taiwan University Department of Computer Science and Information Engineering 17 Experiments The iterative SDP-relaxation algorithm has been implemented in MatLab. The program has been tested on a variety of simulated and biological data. Randomly generated haplotypes and genotypes. Haplotypes and genotypes generated by Hudson’s program. β 2 -Adrenergic receptors (β 2 AR). Cystic fibrosis.
18
National Taiwan University Department of Computer Science and Information Engineering 18 Comparison of the Number of Haplotypes m: # of haplotypes, k: # of SNPs, n: # of genotypes. f: fail to find a solution in two hours.
19
National Taiwan University Department of Computer Science and Information Engineering 19 Experiments on Simulated Data Define e a as the average error rates over 100 data sets. The error rate is the proportion of genotypes whose original haplotype pairs are inferred incorrectly.
20
National Taiwan University Department of Computer Science and Information Engineering 20 Experiments on Biological Data Define e a as the average error rates over 100 data sets. The error rate is the proportion of genotypes whose original haplotype pairs are inferred incorrectly.
21
National Taiwan University Department of Computer Science and Information Engineering 21 Conclusion We proposed an iterative SDP-relaxation algorithm which finds a solution of O(log n) approximaiton. To our best knowledge, this is the first paper that finds the approximation bound for this problem. The error rates of our algorithm is similar to those of HAPLOTYPER and PHASE. The performance of our algorithm is more efficient than HAPAR.
22
National Taiwan University Department of Computer Science and Information Engineering 22 Related Works Statistical methods: Niu, et al. (2002) developed a PL-EM algorithm called HAPLOTYPER. Stephens and Donnelly (2003) designed a MCMC algorithm based on Gibbs sampling called PHASE. Combinatorial methods: Gusfield (2003) proposed an integer linear programming algorithm. Wang and Xu (2003) developed a branching and bound algorithm called HAPAR to find the optimal solution. Brown and Harrower (2004) proposed a new integer linear formulation of this problem.
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.