BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.
Basics of Linkage Analysis
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Approximation Algoirthms: Semidefinite Programming Lecture 19: Mar 22.
Semidefinite Programming
Heuristic alignment algorithms and cost matrices
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
Chebyshev Estimator Presented by: Orr Srour. References Yonina Eldar, Amir Beck and Marc Teboulle, "A Minimax Chebyshev Estimator for Bounded Error Estimation"
Linear-Time Reconstruction of Zero-Recombinant Mendelian Inheritance on Pedigrees without Mating Loops Authors: Lan Liu, Tao Jiang Univ. California, Riverside.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
CSE182-L17 Clustering Population Genetics: Basics.
Job Scheduling Lecture 19: March 19. Job Scheduling: Unrelated Multiple Machines There are n jobs, each job has: a processing time p(i,j) (the time to.
Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Genetic Algorithm.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Solving the Concave Cost Supply Scheduling Problem Xia Wang, Univ. of Maryland Bruce Golden, Univ. of Maryland Edward Wasil, American Univ. Presented at.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
Closest String with Wildcards ( CSW ) Parameterized Complexity Analysis for the Closest String with Wildcards ( CSW ) Problem Danny Hermelin Liat Rozenberg.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Informative SNP Selection Based on Multiple Linear Regression
1 On Completing Latin Squares Iman Hajirasouliha Joint work with Hossein Jowhari, Ravi Kumar, and Ravi Sundaram.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Heuristic Optimization Methods Greedy algorithms, Approximation algorithms, and GRASP.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Linear Programming Revised Simplex Method, Duality of LP problems and Sensitivity analysis D Nagesh Kumar, IISc Optimization Methods: M3L5.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
A Note on Rectangular Quotients By Achiya Dax Hydrological Service Jerusalem, Israel
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.
Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
C&O 355 Lecture 19 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Instructor: Shengyu Zhang 1. Optimization Very often we need to solve an optimization problem.  Maximize the utility/payoff/gain/…  Minimize the cost/penalty/loss/…
Approximation Algorithms based on linear programming.
1 Comparative Study of two Genetic Algorithms Based Task Allocation Models in Distributed Computing System Oğuzhan TAŞ 2005.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
Introduction to SNP and Haplotype Analysis
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Introduction to SNP and Haplotype Analysis
Chapter 6. Large Scale Optimization
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
On solving population haplotype inference problems
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 6. Large Scale Optimization
Parsimony population haplotyping
Presentation transcript:

BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos Kalpakis

BIBE 052 Outline  Biology Review  Motivation  Previous work  Our contribution  Experimental results  Conclusions

BIBE 053 Biology Review  living systems are composed of cells the code for the creation of the cells is packed in a molecule called DNA.  DNA consists of four nucleic acids Adenine, Cytosine, Guanine, and Thymine arranged as complementary strands of a double helix.  DNA strand = string of A,C,G, & T’s.

BIBE 054 Chromosomes  the genome is arranged as set of distinct chromosomes.  mammals are diploids humans have 22 + x and y chromosomes. chromosomes occur in homologous pairs one homologous chromosome is inherited from each parent homologous chromosomes contain the same genes in the same order (up to mutations)

BIBE 055 Single Nucleotide Polymorphisms.  Single Nucleotide Polymorphism (SNP) = mutation of a single base.  evidence suggests that in humans 90% of variation is due to SNPs DNA has long conserved regions punctuated by SNPs  there is one SNP in approximately 1000 bases most SNPS are bi-allelic  at any given locus, only two of the four possible nucleotides are present in 95% of the population  the restriction (projection) of a DNA strand to SNP sites is a haplotype

BIBE 056 What are Genotypes?  the genotype of diploid organisms is the conflation of the inherited haplotypes

BIBE 057 Genotype & Haplotype Std. Representation  genotypes and haplotypes can be represented as a 0,1,2 vectors independently for each site  identify each one of the two letters that appear in it with 0 or 1  replace each homozygous site with 0/1 using the mapping above  replace heterozygous sites with 2

BIBE 058 Haplotypes vs. Genotypes  large scale polymorphism studies such as Linkage Disequilibrium need haplotype information  however, experimentally it is expensive to segregate the haplotypes of the individuals it is easier to observe the genotypes of those individuals  can we find haplotypes from the genotypes computationally? a genotype with h heterozygous sites can be explained (phased) by 2 h-1 different haplotype pairs how do you choose among them?

BIBE 059 Haplotype Phasing with Parsimony  in Population haplotyping, given genotypes from different individuals we want to find a set of haplotypes which resolve all the genotypes Recall that there can be many such solutions Experimental evidence suggests that the number of such haplotypes is small  HPP: Haplotype Phasing Problem with Pure Parsimony Given a set of genotypes, find a minimum size set of haplotypes which conflate to produce the given genotypes  other criteria for choosing among possible sets of haplotypes are perfect phylogeny, minimum total pairwise distance, minimum diameter, etc  we focus on HPP problem Lancia, Pinotti, and Rizzi proved that the HPP is NP–complete as well as APX–hard

BIBE 0510 Clark’s Rule  Clark (1990) describes a greedy inference rule to find a small set of haplotypes resolving a set of genotypes Starting with a set of haplotypes H that resolves all the homozygous genotypes, do the following  for each unresolved genotype g  if there is a pair (h, h’) that resolves g with h in H, then add h’ to H, else stop  the solution obtained is sensitive to the order in which genotypes are resolved  Clark’s rule may terminate with some genotypes unresolved (orphans) The rule can be modified to include a pair of haplotypes that resolve an orphan genotype, and continue as before

BIBE 0511 Gusfield’s TIP  Gusfield (1999) introduces the TIP approach enumerate all distinct haplotypes that can be used to resolve any single heterozygous genotype solve an Integer linear Program (IP) to select a minimum size set haplotypes from the enumerated haplotypes that explains the genotypes TIP uses O(2 L n) variables and constraints, where L is the maximum number of heterozygous loci of any genotype Gusfield describes a number of important improvements to the basic approach above that improve performance

BIBE 0512 Harrower-Brown IP  Harrower and Brown give an alternate 0-1 IP for the HPP problem (HB-IP) explain the n genotypes with 2n haplotypes (not necessarily distinct) the number of distinct haplotypes used are minimized the number of variables and constraints is polynomial in n, m

BIBE 0513 The QIP approach - Outline  arithmetic representation of genotypes  semidefinite programming (SDP)  Quadratic Integer Program (QIP) for HPP a semidefinite programming based heuristic to solve QIP  experimental results  concluding remarks

BIBE 0514 Arithmetic Representation of Genotypes  represent each genotype g as a vector δ with each homozygous locus takes value 0 or 2 iff it was 0 or 1 in g each heterozygous locus takes value 1  conflation can now be replaced by addition if haplotypes h 1 and h 2 explain genotype δ, then  δ = h 1 + h 2 we call δ an arithmetic genotype g = h 1 = h 2 = δ = h 1 = h 2 = g  δ

BIBE 0515 Arithmetic Genotypes  let Δ be n x m matrix with the arithmetic genotypes as rows  let H be k x m matrix with haplotypes as rows  if haplotypes in H resolve Δ, then Δ = S H where S is a n x k matrix  the row of S for a homozygous genotype has a single 2  all other rows have exactly two 1s we call S a selector matrix  i th row of S “selects” two haplotypes (rows of H) to explain i th genotype

BIBE 0516 The k-HPP Problem  the k-HPP problem Given nxm matrix Δ representing a set of n distinct genotypes each with m loci Find an nxk selector matrix S and a kxm 0-1 haplotype matrix H such that  Δ = S H  S has as few non-zero columns as possible  all row-sums of S are 2  HPP is equivalent to k-HPP with k=2n  lower Bounds for HPP is a well known lower bound Lemma: rank(Δ) is a lower bound for HPP  Consider an optimal solution S, H  Since Δ = S H, we know that rank(Δ) = min(rank(S), rank(H)), and thus H must have at least rank(Δ) distinct rows (haplotypes)

BIBE 0517 Finding H given Δ and S  given Δ and H to find an S is easy  given Δ and S find an H by solving a 2-SAT problem If genotype i is resolved by haplotypes t and l, then for each locus j, add following clauses  If δ i,j = 0, add two clauses (¬h t,j ) ^ (¬h l,j )  If δ i,j = 2, add two clauses (h t,j ) ^ (h l,j )  If δ i,j = 1, add clauses (h t,j V h l,j ) ^ (¬h t,j V ¬h l,j )  Only one of the h t,j, h l,j must both be 1 2-SAT problem  has km variables and 2nm clauses  can be solved in (almost) linear time  any satisfying assignment gives a resolution of the genotypes

BIBE 0518 Quadratic, Vector, and Semi-definite Programs  Quadratic Integer Program Optimize a quadratic objective function subject to quadratic constraints on integer variables Strict, when each term has total degree 0 or 2  Vector program optimize a linear objective function of inner products of vector variables subject to linear constraints on inner products of those variables Strict quadratic programs lead to vector programs (products of variables are mapped to inner products of corresponding vectors)  SDP program optimize a linear objective function of the elements of a matrix X subject to  linear constraints on the elements of X  X being a positive semi-definite matrix Vector programs lead to SDP (X is the matrix of all vector inner products)  SDP programs can be solved in polynomial-time with small numerical errors, thus solving vector programs, thus solving relaxations of strict Quadratic Integer programs  construct an approximate solution to a quadratic integer program from a solution of its relaxation, obtained via SDP

BIBE 0519 Quadratic Integer Program for the k-HPP Subject to:

BIBE 0520 QIP Heuristic: SDP+Rounding+Backtracking  recursively solve k-HPP using SDP compute vectors for the variables of QIP for each selector variable S i,j, compute  P[S i,j ]=probability that a random hyperplane separates the vectors of S i,j and z variables (ala MAX-CUT) round to 1 the S i,j * with the highest P[S i,j ] residual k-HPP=k-HPP problem with the rounded S i,j ’s fixed to their rounded value if the residual k-HPP is infeasible  round S i,j * to 0 instead  if the new residual k-HPP is still infeasible  backtrack by returning infeasible recursively solve the residual k-HPP

BIBE 0521 Experiments  we experiment with three approaches for the HPP problem Clark’s rule LP relaxation of Gusfield’s TIP scheme with simple rounding the QIP heuristic for k–HPP with k = 2n  The MATLAB package SDPT 3.02 is used to solve the SDP relaxation of the problem  all experiments are done on a single CPU MATLAB on a Dual Xeon 2.4 Ghz desktop with 1GB memory

BIBE 0522 Experimental Datasets  we use synthetic datasets A and B each with 20 instances for each triplet (n, m, k) = (5, 5, 5), (8, 8, 8), (10, 10, 10), and (15, 15, 15) (and for B, recombination levels ρ = 0, 16 and 40)  generate instances of the HPP problem as follows randomly mate k haplotypes with m loci to produce n genotypes  generation of haplotypes for dataset A each locus of k haplotypes takes value 0/1 with probability ½ independent of other loci and other genotypes  generation of haplotypes for dataset B Use Hudson’s program to generate haplotypes with these parameters  diploid population of size 10 6  mutation rate = 1.5 ×  recombination levels ρ = 0, 16 and 40 corresponding to crossover probabilities 0, 4 × 10 -6, and 10 -5

BIBE 0523 Experimental Results

BIBE 0524 QIP Extensions  QIP can be extended to handle many variants of basic k-HPP problem, such as partial Genotypes  Some loci in some genotypes are unknown shared haplotypes  Prior knowledge of shared haplotypes allowing for erroneous genotypes and loci editing allowing for outlier genotypes

BIBE 0525 Concluding Remarks  developed arithmetic formulation for the HPP problem provides new lower bound yields simple quadratic IP (QIP) QIP can be extended to handle many variants, incorporate prior information etc  SDP relaxation of QIP that can be solved in polynomial time SDP+rounding+backtracking gives QIP heuristic  experimentally Demonstrate competitiveness of QIP heuristic vs Clark’s rule and Gusfield’s TIP relaxation Show that rank of the genotypes is a tighter lower bound than  future work Analysis of worst-case performance ratio of the QIP heuristic Devise algorithms that scale better

BIBE 0526 Thank You ! Questions ?