Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.

Slides:



Advertisements
Similar presentations
Lecture 39 Prof Duncan Shaw. Meiosis and Recombination Chromosomes pair upDNA replication Chiasmata form Recombination 1st cell division 2nd cell divisionGametes.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
PHYLOGENETIC TREES Bulent Moller CSE March 2004.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
L6: Haplotype phasing. Genotypes and Haplotypes Each individual has two “copies” of each chromosome. Each individual has two “copies” of each chromosome.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
BNFO 602 Lecture 1 Usman Roshan.
Haplotyping via Perfect Phylogeny: A Direct Approach
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
CSE182-L17 Clustering Population Genetics: Basics.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
. Phylogenetic Trees (2) Lecture 13 Based on: Durbin et al 7.4, Gusfield , Setubal&Meidanis 6.1.
Population Genetics 101 CSE280Vineet Bafna. Personalized genomics April’08Bafna.
Chapter 14 – The Human Genome
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Combinatorial Problems for Human Polymorphisms Giuseppe Lancia University of Udine.
BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Phylogenetics II.
Incomplete Directed Perfect Phylogeny Itsik Pe'er, Tal Pupko, Ron Shamir, and Roded Sharan SIAM Journal on Computing Volume 33, Number 3, pp
Informative SNP Selection Based on Multiple Linear Regression
CATALYST Recall and Review: – What are chromosomes? – What are genes? – What are alleles? How do these terms relate to DNA? How do these terms relate to.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Chapter 14 - The Human Genome
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Flipping letters to minimize the support of a string Giuseppe Lancia, Franca Rinaldi, Romeo Rizzi University of Udine.
Identification of SNP Alleles in DNA Sequences Giuseppe Lancia Università di Padova e Celera Genomics.
Selecting Genomes for Reconstruction of Ancestral Genomes Louxin Zhang Department of Mathematics National University of Singapore.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Mendel and The Gene Idea Gregor Mendel was a monk who experimented with pea plants. He is known as the “Father of Genetics.” Mendel’s two fundamental.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Optimization Problems for Polymorphisms of Single Nucleotides.
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Introduction to SNP and Haplotype Analysis
Mendel and the Gene Idea
Character-Based Phylogeny Reconstruction
Inherited Change Part II
Chromosomes, Autosomes and Sex chromosomes
Heredity Lesson 8.
Introduction to SNP and Haplotype Analysis
Genetics Primer to Evolution
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Recombination, Phylogenies and Parsimony
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Outline Cancer Progression Models
CATALYST Recall and Review: How do these terms relate to DNA?
On solving population haplotype inference problems
Practical Algorithms for the Single Individual SNP Haplotyping Problem
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity

-The genomic age has allowed to look at ourselves in a detailed, comparative way -All humans are >99% identical at genome level -Small changes in a genome can make a big difference in how we look and who we are

What makes us different from each other? The answer is POLYMORPHISMS

This is true for humans as well as for other species

Polymorphisms are features existing in different “flavours”, that make us all look (and be) different Examples can be eye-color, blood type, hair, etc… In fact, polymorphisms in the way we look (phenotyes) are determined by polymorphisms in our genome

For a given polymorhism, say the eye-color, the possible forms are called alleles We all inherit two alleles (paternal and maternal) identical  HOMOZYGOUS If they are different  HETEROZYGOUS {

mother father child Homozygous

mother father child Homozygous mother father child Heterozygous Dominant Recessive

mother father child Homozygous mother father child Heterozygous mother father child Homozygous Dominant Recessive

mother father child Homozygous mother father child Heterozygous mother father child Homozygous Dominant Recessive

mother father child mother father child mother father child ?? ?? ?? ?? ?? ??

mother father child mother father child mother father child ?? ?? ?? ?? ?? ??

Single Single Nucleotide NucleotidePolymorphisms

At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence SNP S ingle N ucleotide P olymorphism (SNP)

At DNA level, a polymorphism is a sequence of nucleotides varying in a population. The shortest possible sequence has only 1 nucleotide, hence SNP S ingle N ucleotide P olymorphism (SNP) atcgg a ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg t ac atcgg c ttagttagggcacaggacg t ac atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg t ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg t ac atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg g ac

- SNPs are predominant form of human variations atcgg a ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg t ac atcgg c ttagttagggcacaggacg t ac atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg t ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg t ac atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg g ac - Used for drug design, study disease, forensic, evolutionary... - On average one every 1,000 bases

atcgg c ttagttagggcacaggacg t ac atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg t ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg t ac atcgg a ttagttagggcacaggacg t atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg g ac atcgg c ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg g ac atcgg a ttagttagggcacaggacg t ac - SNPs are predominant form of human variations - Used for drug design, study disease, forensic, evolutionary... - On average one every 1,000 bases

ag at ct ag ct cg at ag cg ag cg ag - SNPs are predominant form of human variations - Used for drug design, study disease, forensic, evolutionary... - On average one every 1,000 bases

ag at ct ag ct cg at ag cg ag cg ag HAPLOTYPE HAPLOTYPE : chromosome content at SNP sites

ag at ct ag ct cg at ag cg ag cg ag HAPLOTYPE HAPLOTYPE : chromosome content at SNP sites GENOTYPE GENOTYPE : “union” of 2 haplotypes {c}{g,t} {a,c}{g,t} {a}{g} {a}{g,t} {a}{t} {a,c}{g}

ag at ct ag ct cg at ag cg ag cg ag {a,c}{g,t} {a}{g,t} {c}{g,t} {a}{g} {a}{t} {a,c}{g} CHANGE OF SYMBOLS CHANGE OF SYMBOLS : each SNP only two values in a population (bio). Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE HAPLOTYPE: string over 0, 1 GENOTYPE GENOTYPE: string over 0, 1, 2

ag at ct ag ct cg at ag cg ag cg ag {a,c}{g,t} {a}{g,t} {c}{g,t} {a}{g} {a}{t} {a,c}{g} CHANGE OF SYMBOLS CHANGE OF SYMBOLS : each SNP only two values in a population (bio). Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE HAPLOTYPE: string over 0, 1 GENOTYPE GENOTYPE: string over 0, 1, 2 where 0={0}, 1={1}, 2={0,1}

CHANGE OF SYMBOLS CHANGE OF SYMBOLS : each SNP only two values in a population (bio). Call them 0 and 1. Also, call 2 the fact that a site is heterozygous HAPLOTYPE HAPLOTYPE: string over 0, 1 GENOTYPE GENOTYPE: string over 0, 1, 2 where 0={0}, 1={1}, 2={0,1}

= = = 0 = ALGEBRA OF HAPLOTYPES: Homozygous sites Heterozygous (ambiguous) sites

Phasing the alleles For k heterozygous (ambiguous) sites, there are 2 k-1 possible phasings

THE PHASING (or HAPLOTYPING) PROBLEM Given genotypes of k individuals, determine the phasings of all heterozygous sites. It is too expensive to determine haplotypes directly Much cheaper to determine genotypes, and then infer haplotypes in silico: This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.

The input is GENOTYPE data INPUT: G = { 11221, 22221, 11011, 21221, }

The input is GENOTYPE data OUTPUT: H = { 11011, 11101, 00011, } INPUT: G = { 11221, 22221, 11011, 21221, } Each genotype is resolved by two haplotypes We will define some objectives for H

- -without objectives/constraints, the haplotyping problem would be (mathematically)trivial OBJECTIVES  E.g., always put 0 above and 1 below  the objectives/constraints must be “driven by biology”

2°) 2°) (parsimony): minimize |H| 1°) 1°) Clark’s inference rule 3°) Perfect Phylogeny 4°) Disease Association OBJECTIVES

Obj: Clark’s rule 1st

********** = known haplotype h known (ambiguos) genotype g Inference Rule for a compatible pair h, g

= known haplotype h known (ambiguos) genotype g Inference Rule for a compatible pair h, g new (derived) haplotype h’ We write h + h’ = g

1st Objective (Clark, 1990) 1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic 1st Objective (Clark, 1990) 1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic 1st Objective (Clark, 1990) 1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic 1st Objective (Clark, 1990) 1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic SUCCESS 1st Objective (Clark, 1990) 1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic 1st Objective (Clark, 1990) 1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic 1st Objective (Clark, 1990) 1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic FAILURE (can’t resolve 1122 ) 1st Objective (Clark, 1990) 1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while

1. Start with H = “bootstrap” haplotypes 2. while Clark’s rule applies to a pair (h, g) in H x G 3. apply the rule to any such (h, g) obtaining h’ 4. set H = H + {h’} and G = G - {g} 5. end while If, at end, G is empty, SUCCESS, otherwise FAILURE Step 3 is non-deterministic: the algorithm could end without explaining all genotypes even if an explanation was possible. The number of genotypes solved depends on order of application. 1st Objective (Clark, 1990) OBJ: find order of application rule that leaves the fewest elements in G

The problem was studied by Gusfield (ISMB 2000, and Journal of Comp. Biol., 2001) - problem is APX-hard - it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations -solved via ILP of exponential-size (practical for small real instances)

Obj: Max Parsimony 2nd

- Clark conjectured solution (when found) uses min # of haplotypes - this is clearly false - solution with few haplotypes is biologically relevant (as we all descend from a small set of ancestors)

minimize |H| 2nd Objective (parsimony) 2nd Objective (parsimony) :

1. The problem is APX-Hard Reduction from VERTEX-COVER

A B C D E

A B C D E A B C D E *

A B C D E AB BC AE DE AD

A B C D E A B C D E * AB BC AE DE AD A B C D E

A B C D E A B C D E * AB 2 2 BC 2 2 AE 2 2 DE 2 2 AD 2 2 A B C D E

A B C D E A B C D E * AB 2 2 BC 2 2 AE 2 2 DE 2 2 AD 2 2 A 0 B 0 C 0 D 0 E 0

A B C D E A B C D E * AB BC AE DE AD A 0 0 B 0 0 C 0 0 D 0 0 E 0 0

A B C D E A B C D E * AB BC AE DE AD A B C D E

A B C D E A B C D E * AB BC AE DE AD A B C D E G = (V,E) has a node cover X of size k  there is a set H of |V | + k haplotypes that explain all genotypes

A B C D E A B C D E * AB BC AE DE AD A B C D E G = (V,E) has a node cover X of size k  there is a set H of |V | + k haplotypes that explain all genotypes

A B C D E A B C D E * AB BC AE DE AD A B C D E A’ B’ E’ G = (V,E) has a node cover X of size k  there is a set H of |V | + k haplotypes that explain all genotypes

A basic ILP formulation

Expand your input G in all possible ways A basic ILP formulation

Expand your input G in all possible ways , , A basic ILP formulation

Expand your input G in all possible ways , , A basic ILP formulation

The resulting Integer Program (IP1):

Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations

Obj: Perfect Phylogeny 3rd

- Parsimony does not take into account mutations/evolution of haplotypes - parsimony is very relialable on “small” haplotype blocks - when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination) - the cleanest model for evolution is the perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree - Leaf nodes are labeled with species - Each feature labels an edge leading to a subtree that possesses it perfect phylogeny 3rd objective is based on perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree - Leaf nodes are labeled with species - Each feature labels an edge leading to a subtree that possesses it has 2 legs perfect phylogeny 3rd objective is based on perfect phylogeny has tail flies

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree - Leaf nodes are labeled with species - Each feature labels an edge leading to a subtree that possesses it has 2 legs But…a new species may come along so that no Perfect phylogeny is possible… has tail flies

Theorem Theorem: such matrix has p.p. iff there is not a 00 4x2 minor Human Mouse Spider Eagle two legs tail flies

Theorem Theorem: such matrix has p.p. iff there is not a 00 4x2 minor Human Mouse Spider Eagle Mickey mouse two legs tail flies

We can consider each SNP as a binary feature Objective: Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)

We can consider each SNP as a binary feature Objective: Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)

We can consider each SNP as a binary feature Objective: Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)

We can consider each SNP as a binary feature Objective: Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree) NOT a perfect phylogeny solution !

We can consider each SNP as a binary feature Objective: Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree)

We can consider each SNP as a binary feature Objective: Objective: We want the solution to admit a perfect phylogeny (Rationale : we assume haplotypes have evolved independently along a tree) A perfect phylogeny

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature - There is a graph for which SNPs are columns and edges are of two types (forced and free) - forced edges connect pairs of SNPs that must be phased in the same way 22  or 22  a complex visit of the graph decides how to phase free SNPs

Obj: Disease Association 4th

Some diseases may be due to a gene which has “faulty” configurations RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy Two individuals of which one is healthy and the other diseased may have the same genotype. The explanation of the disease lies in a difference in their haplotypes

INPUT: GD = { 11221,21221,02011 }, GH = {11221,02201,00011} 11221

OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101 } H contains H D, s.t. each diseased has >=1 haplotype in H D and each healty none INPUT: GD = { 11221,21221,02011 }, GH = {11221,02201,00011}

Theorem 1 is proved via a reduction from 3 SAT Theorem 2 has a mathematical proof (coloring argument) with little relation to biology: There is R (depending on input) s.t. a haplotype is healthy if the sum of its bits is congruent to R modulo 3 This means the model must be refined!

Summary: - haplotyping in-silico needed for economical reasons - several objectives, all biologically driven - nice combinatorial problems (mostly from binary nature of SNPs) - these problems are technology-dependant and may become obsolete (hopefully after we have retired)

Thanks