The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations.

Slides:



Advertisements
Similar presentations
Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
MALD Mapping by Admixture Linkage Disequilibrium.
Signatures of Selection
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
A Data Compression Problem The Minimum Informative Subset.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Evaluating Hypotheses
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
CSE182-L17 Clustering Population Genetics: Basics.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
Practical Statistical Analysis Objectives: Conceptually understand the following for both linear and nonlinear models: 1.Best fit to model parameters 2.Experimental.
Lecture 20: Cluster Validation
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Large-scale recombination rate patterns are conserved among human populations David Serre McGill University and Genome Quebec Innovation Center UQAM January.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
Issues concerning the interpretation of statistical significance tests.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
SNPs, Haplotypes, Disease Associations Algorithmic Foundations of Computational Biology II Course 1 Prof. Sorin Istrail.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Efficient Algorithms for SNP Haplotype Block Selection Problems Yaw-Ling Lin ( 林耀鈴 ) Dept Computer Sci and Info Engineering College of Computing and Informatics.
1 of 31 The EPA 7-Step DQO Process Step 6 - Specify Error Tolerances 60 minutes (15 minute Morning Break) Presenter: Sebastian Tindall DQO Training Course.
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
The Haplotype Blocks Problems Wu Ling-Yun
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Introduction to SNP and Haplotype Analysis
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Of Sea Urchins, Birds and Men
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
SNP Haplotype Block Partition and tagSNP Finding
L4: Counting Recombination events
Estimating Recombination Rates
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

The Questions Why study haplotypes? How can haplotypes be inferred? What are haplotype blocks? How can haplotype information be used to test associations with disease phenotypes? How shall we select a subset of informative SNPs for large-scale typing? How can haplotype information be visualized

Methods for inferring haplotype blocks and informative SNP selection Detecting haplotype blocks on Chromosomes 6,21,22

Hypothesis – Haplotype Blocks? The genome consists largely of blocks of common SNPs with relatively little recombination shuffling in the blocks – Patil et. al, Science, 2001; Jeffreys et al. Nature Genetics; Daly et al. Nature Genetics, 2001 Compare block detection methods. –How well we can detect haplotype blocks? –Are the detection methods consistent?

Block detection methods Four gamete test, Hudson and Kaplan, Genetics, 1985, 111, – A segment of SNPs is a block if between every pair (aA and bB) of SNPs at most 3 gametes (ab, aB, Ab, AB) are observed. P-Value test – A segment of SNPs is a block if for 95% of the pairs of SNPs we can reject the hypothesis (with P-value 0.05 or 0.001) that they are in linkage equilibrium. LD-based, Gabriel et al. Science,2002,296: – Next slide

Gabriel et al. method For every pair of SNPs we calculate an upper and lower confidence bound on D’ (Call these D’u, D’l) We then split the pairs of SNPs into 3 classes: –Class I: Two SNPs are in ‘Strong LD’ if D’u >.98 and D’l >.7. –Class II: Two SNPs show ‘Strong evidence for recombination’ if D’u <.9. –Class III: The remaining SNP pairs, these are “uninformative”. A contiguous set of SNPs is a block if –(Class II)/(Class I + ClassII) < 5%. Special rules to determine if 2, 3 or 4 SNPs are a block. Furthermore there are distance requirements on the chromosome to determine if the SNPs are a block.

Block View

Block comparison

Conclusions Clear evidence of “blocky” structure in Chromosomes Different block detection methods are highly concordant. However, boundaries defined by these methods are not sharp and we believe there is no single “true” block partition.

Block free SNP selection

What does it mean to tag SNPs? SNP = Single Nucleotide Polymorphism –Caused by a mutation at a single position in human genome, passed along through heredity –Characterizes much of the genetic differences between humans –Most SNPs are bi-allelic –Estimated several million common SNPs (minor allele frequency >10% To tag = select a subset of SNPs to work with

Why do we tag SNPs? Disease Association Studies –Goal: Find genetic factors correlated with disease –Look for discrepancies in haplotype structure –Statistical Power: Determined by sample size –Cost: Determined by overall number of SNPs typed This means, to keep cost down, reduce the number of SNPs typed Choose a subset of SNPs, [tag SNPs] that can predict other SNPs in the region with small probability of error –Remove redundant information

What do we know? SNPs physically close to one another tend to be inherited together –This means that long stretches of the genome (sans mutational events) should be perfectly correlated if not for… Recombination breaks apart haplotypes and slowly erodes correlation between neighboring alleles –Tends to blur the boundaries of LD blocks Since SNPs are bi-allelic, each SNP defines a partition on the population sample. –If you are able to reconstruct this partition by using other SNPs, there would be no need to type this SNP –For any single SNP, this reconstruction is not difficult…

Complications: But the Global solution to the minimum number of tag SNPs necessary is NP-hard The predictions made will not be perfect –Correlation between neighboring tag SNPs not as strong as correlation between neighboring (not necessarily tagged) SNPs Haplotype information is usually not available for technical reasons –Need for Phasing

Tagging SNPs can be partitioned into the following three steps: –Determining neighborhoods of LD: which SNPs can infer each other –Tagging quality assessment: Defining a quality measure that specifies how well a set of tag SNPs captures the variance observed –Optimization: Minimizing the number of tag SNPs

Optimal Haplotype Block-Free Selection of Tagging SNPs for Genome-Wide Association Studies Halldorsson et al (2004)

The Definition of Perfect Prediction of a SNP from a set of SNPs

“Predict a SNP” A G T A A C Site # or SNP # Hap1 Hap2 Nothing to Predict Predicts SNP 3 SNP 2 Predicts SNP 3 GCGC TATA 2 3 Prediction Algorithm If SNP 2 has allele G Then SNP 3 has allele T If SNP 2 has allele C Then SNP3 has allele A

“Predict a SNP” (cont) A G T A A C Site # or SNP # Hap1 Hap2 Nothing to Predict Predicts SNP 3 Predicts SNP 4 Predicts Each of SNPs 2 and 4 PredictsPredicts Predicts each of SNPs 2 and 3

A graphical notation A G T A A C “ The Blue box Predicts the Green SNP”

Three SNPs Predicting Each Other G T A C A C Only one of the three needs to be typed Either one will do

A Pair of SNPs Predicting Another SNP G T A G C T A T G G T T SNPs 1 and 3 together Predict SNP 4 No single SNP (different than SNP 4) can predict SNP

Tagging SNPs can be partitioned into the following three steps: –Determining neighborhoods of LD: which SNPs can infer each other –Tagging quality assessment: Defining a quality measure that specifies how well a set of tag SNPs captures the variance observed –Optimization: Minimizing the number of tag SNPs

Finding Neighborhoods: Goal is to select SNPs in the sample that characterize regions of common recent ancestry that will contain conserved haplotypes Recent common ancestry means that there has been little time for recombination to break apart haplotypes Constructing fixed size neighborhoods in which to look for SNPs is not desirable because of the variability of recombination rates and historical LD across the genome In fact, the size of informative neighborhoods is highly variable precisely because of variable recombination rates and SNP density Authors avoid block-building by recursively creating neighborhood with help of ‘informativeness’ measure

A measure of tagging quality assessment Assume all SNPs are bi-allelic Notation: I(s,t) = Informativeness of a SNP s with respect to a SNP t –i, j are two haplotypes drawn at random from the uniform distribution on the set of distinct haplotype pairs. –Note: I(s,t) =1 implies complete predictability, I(s,t)=0 when t is monomorphic in the population. I(s,t) easily estimated through the use of bipartite clique that defines each SNP –We can write I(s,t) in terms of an edge set Definition of I easily extended to a set of SNPs S by taking the union of edge sets Assumes the availability of haplotype phases New measure avoids some of the difficulties traditional LD measures have experienced when applied to tagging SNP selection –The concept of pairwise LD fails to reliably capture the higher-order dependencies implied by haplotype structure Defning Informativeness:

Bounded-Width Algorithm: k Most Informative SNPs (k-MIS) Input: A set of n SNPs S Output: subset of SNPs S’ such that I(S’,S) is maximal In its most general form, k-MIS is NP-hard by reduction of the set cover problem to MIS Algorithm optimizes informativeness, although easily adapted for other measures Define distance between two SNPs as the number of SNPs in between them k-MIS can be solved as long as distance between adjacent tag SNPs not too large

Define –Assignment A s [i] –S(A s ) –Recursion function I w (s,l, S(A)) = score of the most informative subset of l SNPs chosen from SNPs 1 through s such that A s described the assignment for SNP s. Pseudocode Complexity: O(nk2 w ) in time and O(k2 w ) in space, assuming maximal window w

Evaluation Algorithm evaluated by Leave-One-Out Cross-Validation –accumulated accuracy over all haplotypes gives a global measure of the accuracy for the given data set. SNPs not typed were predicted by a majority vote among all haplotypes in the training set that were identical to the one being inferred – If no such haplotypes existed, the majority vote is taken among all training haplotypes that have the same allele call on all but one of the typed SNPs –etc. When compared to block-based method of Zhang: –Presumably, the advantage is due to the cost imposed by artificially restricting the range of influence of the few SNPs chosen by block boundaries ‘Informativeness’ was shown to be a “good” measure –aligned well with the leave-one-out cross validation results –extremely close to the results of optimizing for haplotype r 2

Premise: Informative SNP selection Select SNPs to use in an association study –Would like to associate single nucleotide polymorphisms (SNPs) with disease. Very large number of SNPs –Chromosome wide studies, whole genome-scans. –For cost effectiveness, select only a subset. Closely spaced SNPs are highly correlated –It is less likely that there has been a recombination between two SNPs if they are close to each other.

SNP selection within blocks Zhang et al. PNAS, Partition chromosome into haplotype blocks. Zhang et al. RECOMB, 2003 H. I. Avi-Itzhak,X. Su, F. M. De La Vega, PSB, 2003 Sebastiani et al. PNAS 2003 Patil et al., PNAS Within blocks one can select the SNPs that maximize entropy or diversity. Zhang et al. AJHG Select a minimal number of SNPs with limited resources.

Block free SNP selection For each SNP define a neighborhood of predictive SNPs. Define a measure of informativeness, how well a set of SNPs predicts a target SNP. Maximize informativeness over all SNPs.

LD Graph Theory The Definition of Perfect Prediction of a SNP from a set of SNPs Combinatorial interpretations of intermediate values of D’ and r2

Distinguishing SNPs G T A A G T A C A C G G A C A T G T A A G T A C A C G G A C A T G A A G A G A G C A G A T SNPs distinguishing every pair of haplotypes

Perfect Distinguishibility G T T C G A C T A T T A G T T C G A C A A C A T A C G T A T C T A T T A A C G C G A C A A T T A

Predictive SNPs G T A A G T T C A C G G A C A T G T A A G T T C A C G G A C A T Set of SNPs Predicts SNP s s s G A A G T C A G G A A T G T A C

Perfect Prediction G T T C G A C T A T T A G T T C G A C A A C A T A C G T A T C T A T T A A C G C G A C A A T T A

The Informativeness Duality Lemma Let M be the SNPs/Haps matrix. S be the set of SNPs (columns). H be the set of Haplotypes (rows) T a subset of S. The following are equivalent: (1) T perfectly predicts every SNP in S (2) T perfectly distinguishes every pair of distinct haplotypes in H

“Predict a SNP” (cont) A G T A A C Site # or SNP # Hap1 Hap2 Nothing to Predict Predicts SNP 3 Predicts SNP 4 Predicts Each of SNPs 2 and 4 PredictsPredicts Predicts each of SNPs 2 and 3

“Predict a SNP” A G T A A C Site # or SNP # Hap1 Hap2 Nothing to Predict Predicts SNP 3 SNP 2 Predicts SNP 3 GCGC TATA 2 3 Prediction Algorithm If SNP 2 has allele G Then SNP 3 has allele T If SNP 2 has allele C Then SNP3 has allele A

Informativeness Each SNP defines a partition on the set of chromosomes –Infer the value each SNP in the population. Our goal is to infer partitions defined by each one of the SNPs. Inferring the partition of every SNP allows us to infer any possible haplotype. 1 GGGAT 2 GCTGA 3 ACGAT 4 ACGAT 5 ACTGA s

Informativeness –For a SNPs, and haplotypes I, J D s( I,J) is the event that SNP s has different alleles for haplotypes I, J –Define I(s,t) = Pr(D s (I,J) | D t (I,J)) –I(s,t) can be estimated from a population sample For each SNP s, define a bipartite graph on the haplotypes Let E(s) denote the edge set s I(s,t) t

The Minimum Informative SNPs problem Given a set S of SNPs, compute The problem is NP-complete in general –Reduction from set cover Tractable in practice –When only nearby SNPs are used as candidates

Bounded Width MIS Only neighboring SNPs inform meaningfully –SNP i can only be used to infer SNP j if there is little evidence of recombination between i and j I(w,S,t) = Informativeness of S w.r.t t when restricted to SNPs in S that are within w/2- neighborhood of t. (k,w)-MIS problem: –Given a set T, compute the k most informative SNPs S that minimize I(w,S,T) (k,w)-MIS can be computed in time O(nk2 w ), and space O(k2 w )

Correct imputation Block vs. block free Zhang et al. Block Free Perlegen dataset #SNPs typed # correct imputations

Correlation of informativeness with imputation in leave one out studies Leave one out Block free Perlegen dataset #SNPs Informativeness

Haplotype blocks

Haplotype Blocks

Union of possible haplotype blocks

Block free – SNPs selected

Haplotype block tagging SNPs

The Definition of Perfect Prediction of a SNP from a set of SNPs

“Predict a SNP” A G T A A C Site # or SNP # Hap1 Hap2 Nothing to Predict Predicts SNP 3 SNP 2 Predicts SNP 3 GCGC TATA 2 3 Prediction Algorithm If SNP 2 has allele G Then SNP 3 has allele T If SNP 2 has allele C Then SNP3 has allele A

“Predict a SNP” (cont) A G T A A C Site # or SNP # Hap1 Hap2 Nothing to Predict Predicts SNP 3 Predicts SNP 4 Predicts Each of SNPs 2 and 4 PredictsPredicts Predicts each of SNPs 2 and 3

A graphical notation A G T A A C “ The Blue box Predicts the Green SNP”

Three SNPs Predicting Each Other G T A C A C Only one of the three needs to be typed Either one will do

A Pair of SNPs Predicting Another SNP G T A G C T A T G G T T SNPs 1 and 3 together Predict SNP 4 No single SNP (different than SNP 4) can predict SNP

Homework G T A G C T A T G G T T Find the minimum subset of SNPs that needs to be typed; I.e., from which the rest of the SNPs can be Predicted.

Answer: Solution 1 = Type SNPs 1 and 3 G T A G C T A T G G T T From SNPs 1 and 3 we can predict SNP 4 From SNP 3 we can predict SNP 2 Another solution (maybe better for Mercury SNPs : ) Solution 2 = Type SNPs 1 and 2.

Informativeness of a SNP Informativeness of a SNP s with respect with SNP t Quantifies the confidence with which we can predict t from s. Le s be a SNP and i,j be haplotypes. Let D(s, i, j) be the event that at s, i and j haps have different alleles The informativeness of s w.r.t. t is given by I(s,t) = Prob [ D(s,i,j) | D(t,i,j) ] i and j are haplotypes drawn uniformly at random from the set of all distinct haplotype pairs.

The Min Informative Subset Problems Observe that: I(s,t) = 1 implies perfect prediction I(s,t) = 0 implies no predictability The Minimum Perfectly Informative Subset of SNPs Problem Input : A set of n SNPs S, a subset T of S, and 0<k<=n Ouput : Does there exist a subset S’ of S-T such that I(S’,T) = 1 and size of S <= k ? The k-Most Informative Subset of SNPs Problem Input : A set of n SNPs S, with a subset T of S, and 0<k<=n Ouput : Find a subset S’ of S-T such that I(S’,T) = MAX {I(S”, T)} and size of S” <= k ?

Basic Insight: The Set Cover Problem The Minimum Perfectly Informative Subset of SNPs Problem is NP-colpmete The k-Most Informative Subset of SNPs Problem is NP-complete

Graph Theory – Min Set Cover Set elements BOYSGIRLS Want: Min number of Sets that cover all elements Or Min number of GIRLS that know all the BOYS

Our Boys and Girls … For a SNP t, the elements are the set of pairs of haplotypes that are distinguished by t. Each SNP s defines a set consisting of all pairs of haplotypes that is distinguished by both s and t. The Minimum Set Cover is Minimum subset of SNPs that Perfectly Predicts the entire sample. The elements: The sets:

Algorithms When S is a set of SNPs in perfect LD with each other (I.e., all in a no 4-gamete block) the k-Most Informative Subset of SNPs can be solved exactly in O(nm) time. n number of SNPs m number of Haplotypes When the distance in SNPs between the predicting SNP(s) and the target SNP is at most w, the (k,w)-Most Informative Subset of SNPs problem can be solved exactly in time O(nk2^w) and space O(k2^w). ALGORITM 1 ALGORITM 2

Block free SNP selection