June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Tutorial by Ma’ayan Fishelson Changes made by Anna Tzemach.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.
Linear-Time Reconstruction of Zero-Recombinant Mendelian Inheritance on Pedigrees without Mating Loops Authors: Lan Liu, Tao Jiang Univ. California, Riverside.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Haplotyping via Perfect Phylogeny: A Direct Approach
Testing Metric Properties Michal Parnas and Dana Ron.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
CSE182-L17 Clustering Population Genetics: Basics.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
Incorporating Mutations
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Combinatorial Optimization and Combinatorial Structure in Computational Biology Dan Gusfield, Computer Science, UC Davis.
Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny Dan Gusfield OSB 2008, Lijiang, China, November 1, 2008.
A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:
MCS312: NP-completeness and Approximation Algorithms
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Fixed Parameter Complexity Algorithms and Networks.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Efficient Haplotype Inference on Pedigrees and Applications Tao Jiang Dept of Computer Science University of California – Riverside (joint work with.
MINATO ZDD Project Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data Toshiki Saitoh (ERATO) Joint work with Masashi.
Incomplete Directed Perfect Phylogeny Itsik Pe'er, Tal Pupko, Ron Shamir, and Roded Sharan SIAM Journal on Computing Volume 33, Number 3, pp
Informative SNP Selection Based on Multiple Linear Regression
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Fast Elimination of Redundant Linear Equations and Reconstruction of Recombination-free Mendelian Inheritance on a Pedigree Authors: Lan Liu & Tao Jiang,
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Chapter AGB. Today’s material Maximum Parsimony Fixed tree versions (solvable in polynomial time using dynamic programming) Optimal tree search.
The Haplotype Blocks Problems Wu Ling-Yun
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Introduction to SNP and Haplotype Analysis
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Recitation 5 2/4/09 ML in Phylogeny
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Efficient Haplotype Inference on Pedigrees and Applications
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca

June 2, Content Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH problem Inference of incomplete perfect phylogeny: algorithms Incomplete pph and missing data Other models: open problems

June 2, Biological terms Diploid organism haplotype A A A maternal G C A paternal genotype homozygous heterozygous  i  i+1  i+2 Biallelic site i |Value(  i )  { A,C,G,T}|  2

June 2, Motivations Human genetic variations are related to diseases ( cancers, diabetes, osteoporoses ) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes The human genome project produces genotype sequences of humans Computational methods to derive haplotypes from genotype data are demanded Ongoing international HapMap project: find haplotype differences on large scaleHapMap population data Combinatorial methods: graphs Set-cover problems Optimization problems

June 2, Haplotyping: the formal model Haplotype: m-vector h= over {0,1} m Genotype: m-sequence g= over {0,1,*} Def. Haplotypes solve genotype g iff : g(i)=* implies h(i)  k(i) h(i)= k(i)= g(i) otherwise * 01 g =

June 2, Examples g = h= k= g solved by g k Clark inference rule g 1 = g 2 = h 1 = g 3 = h 2 = h 1 = g 2 = h 2 = h 1 = h 3 = g 3 = h g 1 h 2 h1h1

June 2, Haplotype inference: the general problem Problem HI: Instance: a set G={g 1, …,g m } of genotypes and a set H={h 1, …,h n } of haplotypes, Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H  H’. H’ derives from an inference RULE

June 2, Type of inference rules Clark’s rule: haplotypes solve g by an iterative rule Gusfield coalescent model: haplotypes are related to genotypes by a tree model Pedigree data: haplotypes are related to genotypes by a directed graph

June 2, Mendelian law and Recombination BA Father CD Mother ACADBCDB C1C2C3C4 BDBD ACAC Parent ACAC BDBD ADAD BCBC Child:

June 2, Pedigree Pedigree, nuclear family, founder

June 2, Pedigree Pedigree, nuclear family, founder Father Mother Children ID Num Genotypes Founders Nuclear family Family trio loop Mating node

June 2, Haplotyping from genotypes: The problem & methods Problem: Input: genotype data (missing). Output: haplotypes. Input data: Data with pedigree (dependent). Data without pedigree info (independent). Statistical methods Find the most likely haplotypes based on genotype data. Adv: solid theoretical bases Disadv: computation intensive Rule-based methods Define rules based on some plausible assumptions and find those haplotypes consistent with these rules. Adv: usually simple thus very fast Disadv: no numerical assessment of the reliability of the results

June 2, HI by the perfect phylogeny model IDEA: 0, 1,1,0,1 0, 1,0,1,1 g1= 0, 1,*,*,1 g2= *, 0,0,0,1 1, 0,0,0,1 0, 0,0,0,1 GH Genotypes are the mating of haplotypes in a tree Given G find H and T that explain G! 00000

June 2, Perfect Phylogeny models Input data: 0-1 matrix A characters, species Output data: phylogeny for A s1s1 s2s2 s3s3 s4s4 c1c1 c3c3 c2c2 c5c5 c4c Path c 3 c 4 s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 R

June 2, Perfect phylogeny each row s i labels exactly one leaf of T each column c j labels exactly one edge of T each internal edge labelled by at least one column c j row s i gives the 0,1 path from the root to s i Def. A pp T for a 0-1 matrix A: s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 Path c 3 c

June 2, pp model: another view L(x) cluster of x: set of leaves of T x s4s4 s2s2 s1s1 s3s3 x A pp is associated to a tree-family (S,C) with S={s 1,…, s n } C={S’  S: S’ is a cluster} s.t.  X, Y in C, if X  Y  then X  Y or Y  X.

June 2, pp : another view A tree-family (S,C) is represented by a 0-1 matrix: c i c i S’ : s j  S’ iff b ji =1 s j Lemma A 0-1 matrix is a pp iff it represents a tree-family for each set in C at least a column

June 2, Haplotyping by the pp A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes: s i haplotype c i SNPs sisi c i 0-1 switch in position i only once in the tree !! SNP site

June 2, Haplotyping and the pp: observations The root of T may not be the haplotype switch or 1-0 switch (directed case) 0-1 switch switch

June 2, HI problem in the pp model Input data: a 0-1-*matrix B n  m of genotypes G Output data: a 0-1 matrix B’ 2n  m of haplotypes s.t. (1) each g  G is solved by a pair of rows in B’ (2) B’ has a pp (tree family) DECISION Problem 0, 1,0,1,1 01*1*001* 001*11* *1*1* ???

June 2, An example a * * b 0 * c 1 0 a 1 0 a’ 0 1 b 0 1 b’ 0 0 c 1 0 c’ 1 0 a c c’ b’ a’b

June 2, The pph problem: solutions An undirected algorithm Gusfield Recomb 2002 An O(nm 2 )- algorithm Karp et al. Recomb 2003 A linear time O(nm) algorithm ?? Optimal algorithm A related problem: the incomplete directed pp (IDP) Inferring a pp from a 0-1-* matrix O(nm + klog 2 (n+ m)) algorithm Peer, T. Pupko, R. Shamir, R. Sharan SIAM 2004

June 2, IDP problem OPEN PROBLEM: find an optimal algorithm ?? C1C1 C 2 C 4 C5C5 C3 C3 S2S2 S1S1 S3S3 1 ? ? ? ? 0 1 ? ? ? ? ? 0 1 ? ? ? 0 1 ? ? ? ? 0 1 ? ? Instance: A 0-1-? Matrix A Solution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists”

June 2, Decision algorithms for incomplete pp Based on: Characterization of 0-1 matrix A that has a pp -Tree family - - forbidden submatrix – give a no certificate XY Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1} Forbidden subgraph c C’ s1s1 s3s3 s2s

June 2, Test: a 0-1 matrix A has a pp? O(nm) algorithm ( Gusfield 1991 ) Steps: 1. Given A order {c 1, …,c m } as (decreasing) binary numbers A’ 2. Let L(i,j)=k, k = max{l <j: A’[i,l]=1} 3. Let index(j) = max{L(i,j): i} 4. Then apply th. TH. A’ has a pp iff L(i,j) = index(j) for each (i,j) s.t. A’[i,j]=1

June 2, Idea:

June 2, The IDP algorithm c C’ s1s1 s3s3 s2s2

June 2, Other HI problems via the pp model Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rows genotype pp (Igpp) genotype rows Algorithms: Ihpp = IDP given a row as a root (polynomial time) NP-complete otherwise Igpp has polynomial solution under rich data hypothesis ( Karp et al. Recomb 2004 – Icalp 2004 ) NP-complete otherwise

June 2, HI problem and other models Haplotype inference in pedigree data under the recombination model maternal paternal recombination child

June 2, Pedigree graph Single Mating Pedigree Tree Mating loop Nuclear family Pedigree Graph fathermather child

June 2, Haplotype inference in pedigree |0 0|1 1|0 1|1 0| |0 1|0 0|0 0|1 0|0 1|0 0|1 1|1 0|0 Paternalmaternal |1 1|1 1|0

June 2, Problems: MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) SPT-MRHI (Pedigree tree single-mating minimum recombination HI) OPEN Np-complete even if the graph is acyclic, but unbounded number of children…

June 2, Conclusions

June 2, References