Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.

Slides:



Advertisements
Similar presentations
Guy EvenZvi LotkerDana Ron Tel Aviv University Conflict-free colorings of unit disks, squares, & hexagons.
Advertisements

Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Haplotyping via Perfect Phylogeny: A Direct Approach
Wei-Bung Wang Tao Jiang
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
A dynamic program algorithm for haplotype block partitioning Zhang, et. al. (2002) PNAS. 99, 7335.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003.
Incorporating Mutations
Combinatorial Optimization and Combinatorial Structure in Computational Biology Dan Gusfield, Computer Science, UC Davis.
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny Dan Gusfield OSB 2008, Lijiang, China, November 1, 2008.
A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Haplotyping algorithms and structure of human variation EECS 458 CWRU Fall 2004 Readings: see papers on the course website.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Efficient Haplotype Inference on Pedigrees and Applications Tao Jiang Dept of Computer Science University of California – Riverside (joint work with.
Sorting by Cuts, Joins and Whole Chromosome Duplications
Incomplete Directed Perfect Phylogeny Itsik Pe'er, Tal Pupko, Ron Shamir, and Roded Sharan SIAM Journal on Computing Volume 33, Number 3, pp
Informative SNP Selection Based on Multiple Linear Regression
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
Imputation-based local ancestry inference in admixed populations
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Introduction to SNP and Haplotype Analysis
Of Sea Urchins, Birds and Men
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Review of paper submitted to NAR - confidential
On solving population haplotype inference problems
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5 1 Computer Science and Applied Mathematics, Weizmann Institute of Science 2 Molecular Genetics, Weizmann Institute of Science 3 Génétique Médicale, Universitätsspital Lausanne 4 School of Computer Science, Tel- Aviv University 5 Medical and Population Genetics Group, Broad Institute

Overview Introduction Introduction Xor PPH Xor PPH Theoretical outlines and results Theoretical outlines and results Experimental results Experimental results Informative SNPs Informative SNPs Theoretical results Theoretical results Summary and Future research Summary and Future research

Chromosomes

AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATTAGCTGCCACA AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATTAGCTGCCACA AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGATTAGCTGCCACA AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA A T T A A A G T T T G G A A C C C C C C C T T T SNP – Single nucleotide polymorphism

A T T A A A G T T T G G A A C C C C C C C T T T

Haplotypes, Genotypes and XOR-Genotypes Genotype: A/T T/G A C Haplotypes:AGAC TTAC XOR-Genotype: Het Het Hom Hom 1234 A T T A A A G T T T G G A A C C C C C C C T T T

Haplotypes, Genotypes and XOR-Genotypes 1234 A T T A A A G T T T G G A A C C C C C C C T T T Genotype: Haplotypes: {1, 2} XOR-Genotype: {1, 2}

Perfect Phylogeny SNPs only : 1 → 0 1: 1 → 0 5: 0 → 1 2: 0 → 13: 0 →

Previous work Haplotyping: haplotypes from genotypes: Input: Genotypes G={G 1,…,G n } on SNPs S={s 1,…,s m } Output: Find the haplotypes H={H 1,…,H 2 n } that gave rise to G General heuristics: General heuristics: Clark ’90 Clark ’90 Excoffier+Slatkin ‘95 Excoffier+Slatkin ‘95 PPH: Perfect phylogeny haplotyping ( n genotypes, m SNPs): PPH: Perfect phylogeny haplotyping ( n genotypes, m SNPs): Gusfield 2002O( nm  ( n, m )) Gusfield 2002O( nm  ( n, m )) Bafna et. al 2002O( nm 2 ) Bafna et. al 2002O( nm 2 ) Eskin et. al 2003O( nm 2 ) Eskin et. al 2003O( nm 2 ) Graph Realization

Previous work Tutte 1959 O(n 2 m), Gavril and Tamari 1983 O(nm 2 ), Bixby and Wagner 1988 O(nm  (n,m)) Bixby and Wagner 1988 O(nm  (n,m)) The graph realization problem: The graph realization problem: Input: A hypergraph H=({1,…,m}, P) Input: A hypergraph H=({1,…,m}, P)  P={P 1,P 2,…,P n }, P i  {1,…,m} Goal: A tree T=(V,E) with E=N s.t  P i labels a path in T Goal: A tree T=(V,E) with E=N s.t  P i labels a path in T Input: { {1,2}, {2,3} } Output:

Overview Introduction Introduction Xor PPH Xor PPH Theoretical outlines and results Theoretical outlines and results Experimental results Experimental results Informative SNPs Informative SNPs Theoretical results Theoretical results Summary and Future research Summary and Future research

Xor-haplotyping: haplotypes from xor-genotypes: Input: 1. Xor-genotype data(can be obtained by DHPLC) 2. Three genotypes 2. Three genotypes Goal: Resolve the haplotypes and their perfect phylogeny XPPH - Xor perfect phylogeny haplotyping haplotypes Xor-genotypes genotypes {1, 2} 0/1 0/1 0 1 {2, 4} 0 0/1 {2, 3, 4} 0 0/1 0/1 0/1 {1, 2, 4} 0/1 0/1 0 0/1 {1} 0/ ? ? ? ? ?

Xor-haplotyping: haplotypes from xor-genotypes: Input: 1. Xor-genotype data(can be obtained by DHPLC) 2. Three genotypes 2. Three genotypes Goal: Resolve the haplotypes and their perfect phylogeny XPPH - Xor perfect phylogeny haplotyping haplotypes Xor-genotypes genotypes {1, 2} 0/1 0/1 0 1 {2, 4} 0 0/1 {2, 3, 4} 0 0/1 0/1 0/1 {1, 2, 4} 0/1 0 0/1 0/1 {1} 0/ ? ? ? ? ?

Strategy: 1. Input: Xor-genotype data Goal: Find the perfect phylogeny 2. Additional Input: 3 genotypes 2. Additional Input: 3 genotypes Goal: Find haplotypes Step 1: Xor-genotype = {Het SNPs} = A path in the perfect phylogeny  Build a tree from its paths  Graph realization Input reduction: Merge SNPs that are equivalent in the xor-data Proof: Unique graph realization solution  A perfect phylogeny XPPH - Xor perfect phylogeny haplotyping

GREAL  Find graph realization or determine that none exists  Count num of graph realization solutions for data Stable and fast Stable and fast Available at Available at Simulations  Simulate data of n individuals using Hudson 2002  Remove all SNPs with <5% minor allele frequency  Apply GREAL: Is there a single solution?  Repeat 5000 times for each n We implemented Gavril & Tamari’s algorithm (83) for graph realization: O(m 2 n)

Results The percentage of single solutions vs sample size

R.H. Chung and D. Gusfield 2003 Results

Perfect phylogeny Perfect phylogeny ? Haplotypes Step {1, 2} {1, 3} {2, 3} Xor-genotypes ? XPPH Resolution up to bit flipping : gives the haplotypes structure

1 2 3 {1, 2} {1, 3} {2, 3} Xor-genotypes Genotype 1 x x 0 x x SNP #1 homozygous  Can infer SNP #1 for all haplotypes SNP #1 homozygous  Can infer SNP #1 for all haplotypes  Need individuals with  xor-genotypes (=  {het SNPs}) =  XPPH Perfect phylogeny Perfect phylogeny ? Haplotypes Step 2

Theorem:  xor-genotypes=   there are three xor-genotypes with empty intersection Proof: ! xor-genotypes are tree paths (ow: NP-hard) (1) The intersection of two tree paths is an interval

(Proof) (2) Pick X 1 arbitrarily, take X 1  X 2, X 1  X 3, … X 1  X n X1X1X1X1

X1X1X1X1

(3) X L ends first, X R begins last XLXLXLXL XRXRXRXR X1X1X1X1 X1X1X1X1

(Proof) (2) Pick X 1 arbitrarily, take X 1  X 2, X 1  X 3, … X 1  X n (3) X L ends first, X R begins last XLXLXLXL XRXRXRXR X1X1X1X1 XLXLXLXL XRXRXRXR X1X1X1X1

(Proof) (2) Pick X 1 arbitrarily, take X 1  X 2, X 1  X 3, … X 1  X n  X 1  X L  X R =  XLXLXLXL XRXRXRXR X1X1X1X1 XLXLXLXL XRXRXRXR X1X1X1X1 XLXLXLXL XRXRXRXR X1X1X1X1

Find 3 individuals to genotype in O( nm ) Find 3 individuals to genotype in O( nm ) Resolve the haplotypes Resolve the haplotypes XLXLXLXL XRXRXRXR X1X1X1X1 XLXLXLXL XRXRXRXR X1X1X1X1 XLXLXLXL XRXRXRXR X1X1X1X1

Overview Introduction Introduction Xor PPH Xor PPH Theoretical outlines and results Theoretical outlines and results Experimental results Experimental results Informative SNPs Informative SNPs Theoretical results Theoretical results Summary and Future research Summary and Future research

Input: 1. Haplotypes H={H 1,…,H n } on SNPs S={s 1,…,s m } 2. A set of interesting SNPs S"  S Output: Minimal set S  S\S" that distinguishes the same haplotypes as S" Informative SNPs (Bafna et al. 2003): Informative SNPs Haplotypes SNPs Not perfect phylogeny: NP-hard ( MINIMUM TEST SET ) Perfect phylogeny, 1 interesting SNP: O( nm ), Bafna et al. 2003

Informative SNPs: Input: 1. Haplotypes H={H 1,…,H n } on SNPs S={s 1,…,s m } 2. A set of interesting SNPs S"  S 3. A perfect phylogeny for H. 4. A cost function C:S  R +. Output: S  S\S" with minimal cost that distinguishes the same haplotypes as S" Informative SNPs Generalization of prev def Haplotypes SNPs

 We find informative SNPs set Of minimal cost Of minimal cost For any number of interesting SNPs For any number of interesting SNPs In O( m ) In O( m )  By a dynamic programming algorithm that climbs up the perfect phylogeny tree  We prove that the definition of informative SNPs generalizes to a more practical definition  Under the perfect phylogeny model, informative SNPs on genotypes and haplotypes are equivalent

Summary Xor-haplotyping: Xor-haplotyping: Definition Definition Resolve haplotypes given xor-data and 3 genotypes in O( nm  ( m, n )) Resolve haplotypes given xor-data and 3 genotypes in O( nm  ( m, n )) Implementation Implementation Experimental results Experimental results Selection of tag SNPs: Selection of tag SNPs: Generalize to Generalize to arbitrary cost arbitrary cost many interesting SNPs many interesting SNPs Find optimal informative SNPs set in O( m ) time Find optimal informative SNPs set in O( m ) time Combinatorial observation allows practical uses Combinatorial observation allows practical uses

Future research Relax the strong assumption of perfect phylogeny Relax the strong assumption of perfect phylogeny Deal with data errors and missing data Deal with data errors and missing data Obtain empirical results for the theoretical work on informative SNPs Obtain empirical results for the theoretical work on informative SNPs Preliminary results show that blocks of up to 600 SNPs are distinguishable by ~20 informative SNPs Preliminary results show that blocks of up to 600 SNPs are distinguishable by ~20 informative SNPs

Theorem: All genotypes are distinct within a block Proof: Assume to the contrary equivalency of two: Haplotype Pair 1 Haplotype Pair 2 Genotype 1 Genotype 2