How Accurate is Pure Parsimony Haplotype Inferencing

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Haplotyping via Perfect Phylogeny: A Direct Approach
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
A dynamic program algorithm for haplotype block partitioning Zhang, et. al. (2002) PNAS. 99, 7335.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee Climer and Weixiong Zhang This research was supported in.
Combinatorial Reconstruction of Sibling Relationships in Absence of Parental Data Tanya Y Berger-Wolf (DIMACS and UIC CS) Bhaskar DasGupta (UIC CS) Wanpracha.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
Informative SNP Selection Based on Multiple Linear Regression
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
A Linear Search Strategy Using Bounds Sharlee Climer and Weixiong Zhang.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Clustering and optimization in genetic data: the problem of Tag-SNPs selection Paola Bertolazzi, Serena D‘ Aguanno, Giovanni Felici *, Paola Festa** *
Genotype Calling Matt Schuerman. Biological Problem How do we know an individual’s SNP values (genotype)? Each SNP can have two values (A/B) Each individual.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Imputation-based local ancestry inference in admixed populations
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Analyzing Planning Domains For Task Decomposition AHM Modasser Billah( ), Raju Ahmed( ) Department of Computer Science and Engineering (CSE),
Efficient Point Coverage in Wireless Sensor Networks Jie Wang and Ning Zhong Department of Computer Science University of Massachusetts Journal of Combinatorial.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Yufeng Wu and Dan Gusfield University of California, Davis
Common variation, GWAS & PLINK
Introduction to SNP and Haplotype Analysis
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
SNP Haplotype Block Partition and tagSNP Finding
How to Solve NP-hard Problems in Linear Time
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Sharlee Climer Department of Computer Science and Engineering
Correlation for a pair of relatives
Haplotype Reconstruction
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Identification of Paralogs in RADseq data
What are BLUP? and why they are useful?
Chapter 11 Limitations of Algorithm Power
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Basic SNP characteristics using different filter settings for missing data. Basic SNP characteristics using different filter settings for missing data.
On solving population haplotype inference problems
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project
Presentation transcript:

How Accurate is Pure Parsimony Haplotype Inferencing How Accurate is Pure Parsimony Haplotype Inferencing? Sharlee Climer Department of Computer Science and Engineering Department of Biology Washington University in Saint Louis sharlee@climer.us www.climer.us Joint work with Weixiong Zhang and Gerold Jaeger RECOMB SNPs Workshop/Jan 28, 2007

Pure Parsimony Pure Parsimony Haplotype Inferencing (PPHI) Find smallest set of unique haplotypes that can resolve a set of genotypes Suggested by Earl Hubbell in 2000 Cast as an Integer Linear Program (IP) by Dan Gusfield [CPM’03] Great research interest RECOMB SNPs Workshop/Jan 28, 2007

Overview Biological forces Haplotypes with low frequency Define haplotype classes Data sets Characteristics of real data RECOMB SNPs Workshop/Jan 28, 2007

Biological forces RECOMB SNPs Workshop/Jan 28, 2007

Biological forces RECOMB SNPs Workshop/Jan 28, 2007

Biological forces RECOMB SNPs Workshop/Jan 28, 2007

Biological forces RECOMB SNPs Workshop/Jan 28, 2007

Biological forces RECOMB SNPs Workshop/Jan 28, 2007

Biological forces RECOMB SNPs Workshop/Jan 28, 2007

Biological forces RECOMB SNPs Workshop/Jan 28, 2007

Biological forces Relatively few unique haplotypes Subset of haplotypes with low frequency Problems for PPHI Large number of optimal solutions True biological solution might not be parsimonious What are structural characteristics of optimal solutions? RECOMB SNPs Workshop/Jan 28, 2007

Classes of haplotypes Set of possible haplotypes is exponentially large Partition similar to Traveling Salesman Problem Backbone haplotypes Appear in every optimal solution Fat haplotypes Do not appear in any optimal solution Fluid haplotypes Appear in some, but not all, optimal solutions RECOMB SNPs Workshop/Jan 28, 2007

Backbone haplotypes Implicit backbones Explicit backbones All haplotypes that resolve unambiguous genotypes Explicit backbones Can identify by solving at most one IP for each haplotype in solution that isn’t implicit backbone RECOMB SNPs Workshop/Jan 28, 2007

Backbone haplotypes h3 h7 h15 h27 h39 h50 h55 h79 h91 bb bb bb bb RECOMB SNPs Workshop/Jan 28, 2007

Backbone graph RECOMB SNPs Workshop/Jan 28, 2007

Backbone graph RECOMB SNPs Workshop/Jan 28, 2007

An optimal solution RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype RECOMB SNPs Workshop/Jan 28, 2007

Low frequency haplotype RECOMB SNPs Workshop/Jan 28, 2007

Data sets 7 true haplotype data sets Orzack et al.[Genetics, 2003] 80 genotypes 9 sites ApoE Andres et al. [Genet. Epi., in press] 6 sets of complete data 39 genotypes 5 to 47 sites KLK13 and KLK14 RECOMB SNPs Workshop/Jan 28, 2007

Data sets HapMap data [Nature 2003, 2005] Phase unknown Random instance generator 20 unique genotypes 20 sites Three populations CEU YRI JPT+CHB 22 chromosomes RECOMB SNPs Workshop/Jan 28, 2007

Size of haplotype backbone RECOMB SNPs Workshop/Jan 28, 2007

Number of fluid haplotypes in each solution RECOMB SNPs Workshop/Jan 28, 2007

Number of optimal solutions RECOMB SNPs Workshop/Jan 28, 2007

Number of fluid haplotypes and solutions RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness Data set # gen. # sites # BB hap. #fluid hap. # opt. sols. Avg. distance to real A 30 9 15 1 8 B 10 5 7 C 18 17 3 16 7.5 D 6 4 2.5 E 23 26 >1000 4.33 F 22 12 630 28.24 G 35 47 10.95 RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness Data set Parsimony # of haplotypes True # of haplotypes A 15 17 B 7 C 12 D E 16 F 18 G 28 32 RECOMB SNPs Workshop/Jan 28, 2007

Biological correctness Accuracy of backbone haplotypes Two data sets (F and G) had errors One parsimony backbone haplotype not in real solution RECOMB SNPs Workshop/Jan 28, 2007

Number of solutions vs. number of genotypes RECOMB SNPs Workshop/Jan 28, 2007

Conclusions Biological forces tend to minimize cardinality, but also create low frequency haplotypes Low frequency in unique genotypes might not be low frequency in full set Low frequency haplotypes Large number of optimal solutions True solution not necessarily parsimonious Combinatorial nature can lead to errors in backbones Parsimony combined with other biological clues RECOMB SNPs Workshop/Jan 28, 2007