June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

PHYLOGENETIC TREES Bulent Moller CSE March 2004.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 14 Strongly connected components Definition and motivation Algorithm Chapter 22.5.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Haplotyping via Perfect Phylogeny: A Direct Approach
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
CSE182-L17 Clustering Population Genetics: Basics.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Haplotyping via Perfect Phylogeny - Model, Algorithms, Empirical studies Dan Gusfield, Ren Hua Chung U.C. Davis Cocoon 2003.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Combinatorial Optimization and Combinatorial Structure in Computational Biology Dan Gusfield, Computer Science, UC Davis.
Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny Dan Gusfield OSB 2008, Lijiang, China, November 1, 2008.
A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield RECOMB 2005, pp. 585–600 Date:
MCS312: NP-completeness and Approximation Algorithms
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Fixed Parameter Complexity Algorithms and Networks.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Efficient Haplotype Inference on Pedigrees and Applications Tao Jiang Dept of Computer Science University of California – Riverside (joint work with.
MINATO ZDD Project Efficient Enumeration of the Directed Binary Perfect Phylogenies from Incomplete Data Toshiki Saitoh (ERATO) Joint work with Masashi.
Incomplete Directed Perfect Phylogeny Itsik Pe'er, Tal Pupko, Ron Shamir, and Roded Sharan SIAM Journal on Computing Volume 33, Number 3, pp
Informative SNP Selection Based on Multiple Linear Regression
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Spanning tree Lecture 4.
. Perfect Phylogeny Tutorial #10 © Ilan Gronau Original slides by Shlomo Moran.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Introduction to SNP and Haplotype Analysis
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Efficient Haplotype Inference on Pedigrees and Applications
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
On solving population haplotype inference problems
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca

June 2, Content Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH problem Inference of incomplete perfect phylogeny: algorithms Incomplete pph and missing data Other models: open problems

June 2, Biological terms Diploid organism haplotype A A A maternal G C A paternal genotype homozygous heterozygous  i  i+1  i+2 Biallelic site i |Value(  i )  { A,C,G,T}|  2

June 2, Motivations Human genetic variations are related to diseases ( cancers, diabetes, osteoporoses ) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes The human genome project produces genotype sequences of humans Computational methods to derive haplotypes from genotype data are demanded Ongoing international HapMap project: find haplotype differences on large scaleHapMap population data Combinatorial methods: graphs Set-cover problems Optimization problems

June 2, Haplotyping: the formal model Haplotype: m-vector h= over {0,1} m Genotype: m-sequence g= over {0,1,*} Def. Haplotypes solve genotype g iff : g(i)=* implies h(i)  k(i) h(i)= k(i)= g(i) otherwise * 01 g =

June 2, Examples g = h= k= g solved by g k Clark inference rule g 1 = g 2 = h 1 = g 3 = h 2 = h 1 = g 2 = h 2 = h 1 = h 3 = g 3 = h g 1 h 2 h1h1

June 2, Haplotype inference: the general problem Problem HI: Instance: a set G={g 1, …,g m } of genotypes and a set H={h 1, …,h n } of haplotypes, Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H  H’. H’ derives from an inference RULE

June 2, Type of inference rules Clark’s rule: haplotypes solve g by an iterative rule Gusfield coalescent model: haplotypes are related to genotypes by a tree model Pedigree data: haplotypes are related to genotypes by a directed graph

June 2, HI by the perfect phylogeny model IDEA: 0, 1,1,0,1 0, 1,0,1,1 g1= 0, 1,*,*,1 g2= *, 0,0,0,1 1, 0,0,0,1 0, 0,0,0,1 GH Genotypes are the mating of haplotypes in a tree Given G find H and T that explain G! 00000

June 2, Perfect Phylogeny models Input data: 0-1 matrix A characters, species Output data: phylogeny for A s1s1 s2s2 s3s3 s4s4 c1c1 c3c3 c2c2 c5c5 c4c Path c 3 c 4 s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 R

June 2, Perfect phylogeny each row s i labels exactly one leaf of T each column c j labels exactly one edge of T each internal edge labelled by at least one column c j row s i gives the 0,1 path from the root to s i Def. A pp T for a 0-1 matrix A: s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 Path c 3 c

June 2, pp model: another view L(x) cluster of x: set of leaves of T x s4s4 s2s2 s1s1 s3s3 x A pp is associated to a tree-family (S,C) with S={s 1,…, s n } C={S’  S: S’ is a cluster} s.t.  X, Y in C, if X  Y  then X  Y or Y  X.

June 2, pp : another view A tree-family (S,C) is represented by a 0-1 matrix: c i c i S’ : s j  S’ iff b ji =1 s j Lemma A 0-1 matrix is a pp iff it represents a tree-family for each set in C at least a column

June 2, Haplotyping by the pp A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes: s i haplotype c i SNPs sisi c i 0-1 switch in position i only once in the tree !! SNP site

June 2, Haplotyping and the pp: observations The root of T may not be the haplotype switch or 1-0 switch (directed case) 0-1 switch switch

June 2, HI problem in the pp model Input data: a 0-1-*matrix B n  m of genotypes G Output data: a 0-1 matrix B’ 2n  m of haplotypes s.t. (1) each g  G is solved by a pair of rows in B’ (2) B’ has a pp (tree family) DECISION Problem 0, 1,0,1,1 01*1*001* 001*11* *1*1* ???

June 2, An example a * * b 0 * c 1 0 a 1 0 a’ 0 1 b 0 1 b’ 0 0 c 1 0 c’ 1 0 a c c’ b’ a’b

June 2, The pph problem: solutions An undirected algorithm Gusfield Recomb 2002 An O(nm 2 )- algorithm Karp et al. Recomb 2003 A linear time O(nm) algorithm ?? Optimal algorithm A related problem: the incomplete directed pp (IDP) Inferring a pp from a 0-1-* matrix O(nm + klog 2 (n+ m)) algorithm Peer, T. Pupko, R. Shamir, R. Sharan SIAM 2004

June 2, IDP problem OPEN PROBLEM: find an optimal algorithm ?? C1C1 C 2 C 4 C5C5 C3 C3 S2S2 S1S1 S3S3 1 ? ? ? ? 0 1 ? ? ? ? ? 0 1 ? ? ? 0 1 ? ? ? ? 0 1 ? ? Instance: A 0-1-? Matrix A Solution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists”

June 2, Decision algorithms for incomplete pp Based on: Characterization of 0-1 matrix A that has a pp -Tree family - - forbidden submatrix – give a no certificate XY Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1} Forbidden subgraph c C’ s1s1 s3s3 s2s

June 2, Test: a 0-1 matrix A has a pp? O(nm) algorithm ( Gusfield 1991 ) Steps: 1. Given A order {c 1, …,c m } as (decreasing) binary numbers A’ 2. Let L(i,j)=k, k = max{l <j: A’[i,l]=1} 3. Let index(j) = max{L(i,j): i} 4. Then apply th. TH. A’ has a pp iff L(i,j) = index(j) for each (i,j) s.t. A’[i,j]=1

June 2, Idea:

June 2, The IDP algorithm c C’ s1s1 s3s3 s2s2

June 2, Other HI problems via the pp model Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rows genotype pp (Igpp) genotype rows Algorithms: Ihpp = IDP given a row as a root (polynomial time) NP-complete otherwise Igpp has polynomial solution under rich data hypothesis ( Karp et al. Recomb 2004 – Icalp 2004 ) NP-complete otherwise

June 2, HI problem and other models Haplotype inference in pedigree data under the recombination model maternal paternal recombination child

June 2, Pedigree graph Single Mating Pedigree Tree Mating loop Nuclear family Pedigree Graph fathermather child

June 2, Haplotype inference in pedigree |0 0|1 1|0 1|1 0| |0 1|0 0|0 0|1 0|0 1|0 0|1 1|1 0|0 Paternalmaternal |1 1|1 1|0

June 2, Problems: MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) SPT-MRHI (Pedigree tree single-mating minimum recombination HI) OPEN Np-complete even if the graph is acyclic, but unbounded number of children…

June 2, Conclusions

June 2, References