Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

Similar presentations


Presentation on theme: "June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca."— Presentation transcript:

1 June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca

2 June 2, 20152 Content Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH problem Inference of incomplete perfect phylogeny: algorithms Incomplete pph and missing data Other models: open problems

3 June 2, 20153 Biological terms Diploid organism haplotype A A A maternal G C A paternal genotype homozygous heterozygous  i  i+1  i+2 Biallelic site i |Value(  i )  { A,C,G,T}|  2

4 June 2, 20154 Motivations Human genetic variations are related to diseases ( cancers, diabetes, osteoporoses ) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes The human genome project produces genotype sequences of humans Computational methods to derive haplotypes from genotype data are demanded Ongoing international HapMap project: find haplotype differences on large scaleHapMap population data Combinatorial methods: graphs Set-cover problems Optimization problems

5 June 2, 20155 Haplotyping: the formal model Haplotype: m-vector h= over {0,1} m Genotype: m-sequence g= over {0,1,*} Def. Haplotypes solve genotype g iff : g(i)=* implies h(i)  k(i) h(i)= k(i)= g(i) otherwise * 01 g =

6 June 2, 20156 Examples g = h= k= g solved by g k Clark inference rule g 1 = g 2 = h 1 = g 3 = h 2 = h 1 = g 2 = h 2 = h 1 = h 3 = g 3 = h g 1 h 2 h1h1

7 June 2, 20157 Haplotype inference: the general problem Problem HI: Instance: a set G={g 1, …,g m } of genotypes and a set H={h 1, …,h n } of haplotypes, Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H  H’. H’ derives from an inference RULE

8 June 2, 20158 Type of inference rules Clark’s rule: haplotypes solve g by an iterative rule Gusfield coalescent model: haplotypes are related to genotypes by a tree model Pedigree data: haplotypes are related to genotypes by a directed graph

9 June 2, 20159 HI by the perfect phylogeny model IDEA: 0, 1,1,0,1 0, 1,0,1,1 g1= 0, 1,*,*,1 g2= *, 0,0,0,1 1, 0,0,0,1 0, 0,0,0,1 GH Genotypes are the mating of haplotypes in a tree Given G find H and T that explain G! 00000

10 June 2, 201510 Perfect Phylogeny models Input data: 0-1 matrix A characters, species Output data: phylogeny for A s1s1 s2s2 s3s3 s4s4 c1c1 c3c3 c2c2 c5c5 c4c4 11000 00100 11001 00110 Path c 3 c 4 s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 R

11 June 2, 201511 Perfect phylogeny each row s i labels exactly one leaf of T each column c j labels exactly one edge of T each internal edge labelled by at least one column c j row s i gives the 0,1 path from the root to s i Def. A pp T for a 0-1 matrix A: s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 Path c 3 c 4 0011

12 June 2, 201512 pp model: another view L(x) cluster of x: set of leaves of T x s4s4 s2s2 s1s1 s3s3 x A pp is associated to a tree-family (S,C) with S={s 1,…, s n } C={S’  S: S’ is a cluster} s.t.  X, Y in C, if X  Y  then X  Y or Y  X.

13 June 2, 201513 pp : another view A tree-family (S,C) is represented by a 0-1 matrix: 01000 00100 11001 00110 c i c i S’ : s j  S’ iff b ji =1 s j Lemma A 0-1 matrix is a pp iff it represents a tree-family for each set in C at least a column

14 June 2, 201514 Haplotyping by the pp A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes: s i haplotype c i SNPs 11000 01001 01000 00000 01000 11000 01001 sisi c i 0-1 switch in position i only once in the tree !! SNP site 01000 00000

15 June 2, 201515 Haplotyping and the pp: observations The root of T may not be the haplotype 000000 0-1 switch or 1-0 switch (directed case) 0-1 switch 01100 11000 01000 00011 1-0 switch 00011 01000 0001100011 0100001000 01010 11010 01010 00011 01001 11001 01001 00000

16 June 2, 201516 HI problem in the pp model Input data: a 0-1-*matrix B n  m of genotypes G Output data: a 0-1 matrix B’ 2n  m of haplotypes s.t. (1) each g  G is solved by a pair of rows in B’ (2) B’ has a pp (tree family) DECISION Problem 0, 1,0,1,1 01*1*001* 001*11*11 0000*1*1* ???

17 June 2, 201517 An example a * * b 0 * c 1 0 a 1 0 a’ 0 1 b 0 1 b’ 0 0 c 1 0 c’ 1 0 a c c’ b’ a’b

18 June 2, 201518 The pph problem: solutions An undirected algorithm Gusfield Recomb 2002 An O(nm 2 )- algorithm Karp et al. Recomb 2003 A linear time O(nm) algorithm ?? Optimal algorithm A related problem: the incomplete directed pp (IDP) Inferring a pp from a 0-1-* matrix O(nm + klog 2 (n+ m)) algorithm Peer, T. Pupko, R. Shamir, R. Sharan SIAM 2004

19 June 2, 201519 IDP problem OPEN PROBLEM: find an optimal algorithm ?? C1C1 C 2 C 4 C5C5 C3 C3 S2S2 S1S1 S3S3 1 ? 0 0 1 ? ? 0 1 0 ? 0 1 ? ? 1 2 3 4 5 1 0 0 0 1 ? ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 1 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 1 0 1 0 1 0 1 0 1 Instance: A 0-1-? Matrix A Solution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists”

20 June 2, 201520 Decision algorithms for incomplete pp Based on: Characterization of 0-1 matrix A that has a pp -Tree family - - forbidden submatrix – give a no certificate 1 0 1 0 1 00 01 10 11 XY Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1} Forbidden subgraph c C’ s1s1 s3s3 s2s2 101101

21 June 2, 201521 Test: a 0-1 matrix A has a pp? O(nm) algorithm ( Gusfield 1991 ) Steps: 1. Given A order {c 1, …,c m } as (decreasing) binary numbers A’ 2. Let L(i,j)=k, k = max{l <j: A’[i,l]=1} 3. Let index(j) = max{L(i,j): i} 4. Then apply th. TH. A’ has a pp iff L(i,j) = index(j) for each (i,j) s.t. A’[i,j]=1

22 June 2, 201522 Idea:

23 June 2, 201523 The IDP algorithm c C’ s1s1 s3s3 s2s2

24 June 2, 201524 Other HI problems via the pp model Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rows genotype pp (Igpp) genotype rows Algorithms: Ihpp = IDP given a row as a root (polynomial time) NP-complete otherwise Igpp has polynomial solution under rich data hypothesis ( Karp et al. Recomb 2004 – Icalp 2004 ) NP-complete otherwise

25 June 2, 201525 HI problem and other models Haplotype inference in pedigree data under the recombination model 0 0 0 1 1 1 maternal 0 0 1 1 0 0 0 0 0 0 paternal 0 0 0 0 0 0 0 0 0 0 1 1 recombination child

26 June 2, 201526 Pedigree graph Single Mating Pedigree Tree Mating loop Nuclear family Pedigree Graph fathermather child

27 June 2, 201527 Haplotype inference in pedigree 00 01 10 11 00 01 11 01 0|0 0|1 1|0 1|1 0|0 01 11 10 0|0 1|0 0|0 0|1 0|0 1|0 0|1 1|1 0|0 Paternalmaternal 011011 110110 0|1 1|1 1|0

28 June 2, 201528 Problems: MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) SPT-MRHI (Pedigree tree single-mating minimum recombination HI) OPEN Np-complete even if the graph is acyclic, but unbounded number of children…

29 June 2, 201529 Conclusions

30 June 2, 201530 References


Download ppt "June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca."

Similar presentations


Ads by Google