Presentation is loading. Please wait.

Presentation is loading. Please wait.

June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.

Similar presentations


Presentation on theme: "June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca."— Presentation transcript:

1 June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca

2 June 2, 20152 Content Motivation: biological terms Combinatorial methods in haplotyping Haplotyping via perfect phylogeny : the PPH problem Inference of incomplete perfect phylogeny: algorithms Incomplete pph and missing data Other models: open problems

3 June 2, 20153 Biological terms Diploid organism haplotype A A A maternal G C A paternal genotype homozygous heterozygous  i  i+1  i+2 Biallelic site i |Value(  i )  { A,C,G,T}|  2

4 June 2, 20154 Motivations Human genetic variations are related to diseases ( cancers, diabetes, osteoporoses ) most common variation is the Single Nucleotide Polymorphism (SNP) on haplotypes in chromosomes The human genome project produces genotype sequences of humans Computational methods to derive haplotypes from genotype data are demanded Ongoing international HapMap project: find haplotype differences on large scaleHapMap population data Combinatorial methods: graphs Set-cover problems Optimization problems

5 June 2, 20155 Haplotyping: the formal model Haplotype: m-vector h= over {0,1} m Genotype: m-sequence g= over {0,1,*} Def. Haplotypes solve genotype g iff : g(i)=* implies h(i)  k(i) h(i)= k(i)= g(i) otherwise * 01 g =

6 June 2, 20156 Examples g = h= k= g solved by g k Clark inference rule g 1 = g 2 = h 1 = g 3 = h 2 = h 1 = g 2 = h 2 = h 1 = h 3 = g 3 = h g 1 h 2 h1h1

7 June 2, 20157 Haplotype inference: the general problem Problem HI: Instance: a set G={g 1, …,g m } of genotypes and a set H={h 1, …,h n } of haplotypes, Solution: a set H’ of haplotypes that solves each genotype g in G s.t. H  H’. H’ derives from an inference RULE

8 June 2, 20158 Type of inference rules Clark’s rule: haplotypes solve g by an iterative rule Gusfield coalescent model: haplotypes are related to genotypes by a tree model Pedigree data: haplotypes are related to genotypes by a directed graph

9 June 2, 20159 Mendelian law and Recombination BA Father CD Mother ACADBCDB C1C2C3C4 BDBD ACAC Parent ACAC BDBD ADAD BCBC Child:

10 June 2, 201510 Pedigree Pedigree, nuclear family, founder

11 June 2, 201511 Pedigree Pedigree, nuclear family, founder Father Mother Children ID Num Genotypes Founders Nuclear family Family trio loop Mating node

12 June 2, 201512 Haplotyping from genotypes: The problem & methods Problem: Input: genotype data (missing). Output: haplotypes. Input data: Data with pedigree (dependent). Data without pedigree info (independent). Statistical methods Find the most likely haplotypes based on genotype data. Adv: solid theoretical bases Disadv: computation intensive Rule-based methods Define rules based on some plausible assumptions and find those haplotypes consistent with these rules. Adv: usually simple thus very fast Disadv: no numerical assessment of the reliability of the results

13 June 2, 201513 HI by the perfect phylogeny model IDEA: 0, 1,1,0,1 0, 1,0,1,1 g1= 0, 1,*,*,1 g2= *, 0,0,0,1 1, 0,0,0,1 0, 0,0,0,1 GH Genotypes are the mating of haplotypes in a tree Given G find H and T that explain G! 00000

14 June 2, 201514 Perfect Phylogeny models Input data: 0-1 matrix A characters, species Output data: phylogeny for A s1s1 s2s2 s3s3 s4s4 c1c1 c3c3 c2c2 c5c5 c4c4 11000 00100 11001 00110 Path c 3 c 4 s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 R

15 June 2, 201515 Perfect phylogeny each row s i labels exactly one leaf of T each column c j labels exactly one edge of T each internal edge labelled by at least one column c j row s i gives the 0,1 path from the root to s i Def. A pp T for a 0-1 matrix A: s4s4 s2s2 s1s1 s3s3 c3c3 c4c4 c2c2 C 1, c5c5 Path c 3 c 4 0011

16 June 2, 201516 pp model: another view L(x) cluster of x: set of leaves of T x s4s4 s2s2 s1s1 s3s3 x A pp is associated to a tree-family (S,C) with S={s 1,…, s n } C={S’  S: S’ is a cluster} s.t.  X, Y in C, if X  Y  then X  Y or Y  X.

17 June 2, 201517 pp : another view A tree-family (S,C) is represented by a 0-1 matrix: 01000 00100 11001 00110 c i c i S’ : s j  S’ iff b ji =1 s j Lemma A 0-1 matrix is a pp iff it represents a tree-family for each set in C at least a column

18 June 2, 201518 Haplotyping by the pp A 0-1 matrix B represents the phylogenetic tree for a set H of haplotypes: s i haplotype c i SNPs 11000 01001 01000 00000 01000 11000 01001 sisi c i 0-1 switch in position i only once in the tree !! SNP site 01000 00000

19 June 2, 201519 Haplotyping and the pp: observations The root of T may not be the haplotype 000000 0-1 switch or 1-0 switch (directed case) 0-1 switch 01100 11000 01000 00011 1-0 switch 00011 01000 0001100011 0100001000 01010 11010 01010 00011 01001 11001 01001 00000

20 June 2, 201520 HI problem in the pp model Input data: a 0-1-*matrix B n  m of genotypes G Output data: a 0-1 matrix B’ 2n  m of haplotypes s.t. (1) each g  G is solved by a pair of rows in B’ (2) B’ has a pp (tree family) DECISION Problem 0, 1,0,1,1 01*1*001* 001*11*11 0000*1*1* ???

21 June 2, 201521 An example a * * b 0 * c 1 0 a 1 0 a’ 0 1 b 0 1 b’ 0 0 c 1 0 c’ 1 0 a c c’ b’ a’b

22 June 2, 201522 The pph problem: solutions An undirected algorithm Gusfield Recomb 2002 An O(nm 2 )- algorithm Karp et al. Recomb 2003 A linear time O(nm) algorithm ?? Optimal algorithm A related problem: the incomplete directed pp (IDP) Inferring a pp from a 0-1-* matrix O(nm + klog 2 (n+ m)) algorithm Peer, T. Pupko, R. Shamir, R. Sharan SIAM 2004

23 June 2, 201523 IDP problem OPEN PROBLEM: find an optimal algorithm ?? C1C1 C 2 C 4 C5C5 C3 C3 S2S2 S1S1 S3S3 1 ? 0 0 1 ? ? 0 1 0 ? 0 1 ? ? 1 2 3 4 5 1 0 0 0 1 ? ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 1 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 ? 0 1 0 ? 0 1 ? ? 1 0 0 0 1 1 1 0 1 0 1 0 1 0 1 Instance: A 0-1-? Matrix A Solution: solve ? Into 0 or 1 to obtain a matrix A’ and a pp for A’, or say “no pp exists”

24 June 2, 201524 Decision algorithms for incomplete pp Based on: Characterization of 0-1 matrix A that has a pp -Tree family - - forbidden submatrix – give a no certificate 1 0 1 0 1 00 01 10 11 XY Bipartite graph G(A)=(S,C,E) with E={(si,cj): bij =1} Forbidden subgraph c C’ s1s1 s3s3 s2s2 101101

25 June 2, 201525 Test: a 0-1 matrix A has a pp? O(nm) algorithm ( Gusfield 1991 ) Steps: 1. Given A order {c 1, …,c m } as (decreasing) binary numbers A’ 2. Let L(i,j)=k, k = max{l <j: A’[i,l]=1} 3. Let index(j) = max{L(i,j): i} 4. Then apply th. TH. A’ has a pp iff L(i,j) = index(j) for each (i,j) s.t. A’[i,j]=1

26 June 2, 201526 Idea:

27 June 2, 201527 The IDP algorithm c C’ s1s1 s3s3 s2s2

28 June 2, 201528 Other HI problems via the pp model Incomplete 0-1-*-? matrix because of missing data: haplotypes pp (Ihpp) haplotype rows genotype pp (Igpp) genotype rows Algorithms: Ihpp = IDP given a row as a root (polynomial time) NP-complete otherwise Igpp has polynomial solution under rich data hypothesis ( Karp et al. Recomb 2004 – Icalp 2004 ) NP-complete otherwise

29 June 2, 201529 HI problem and other models Haplotype inference in pedigree data under the recombination model 0 0 0 1 1 1 maternal 0 0 1 1 0 0 0 0 0 0 paternal 0 0 0 0 0 0 0 0 0 0 1 1 recombination child

30 June 2, 201530 Pedigree graph Single Mating Pedigree Tree Mating loop Nuclear family Pedigree Graph fathermather child

31 June 2, 201531 Haplotype inference in pedigree 00 01 10 11 00 01 11 01 0|0 0|1 1|0 1|1 0|0 01 11 10 0|0 1|0 0|0 0|1 0|0 1|0 0|1 1|1 0|0 Paternalmaternal 011011 110110 0|1 1|1 1|0

32 June 2, 201532 Problems: MPT-MRHI (Pedigree tree multi-mating minimum recombination HI) SPT-MRHI (Pedigree tree single-mating minimum recombination HI) OPEN Np-complete even if the graph is acyclic, but unbounded number of children…

33 June 2, 201533 Conclusions

34 June 2, 201534 References


Download ppt "June 2, 20151 Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca."

Similar presentations


Ads by Google