# SNP Haplotype reconstruction Statistics 246, 2002, Week 14, Lecture 2 Not complete.

## Presentation on theme: "SNP Haplotype reconstruction Statistics 246, 2002, Week 14, Lecture 2 Not complete."— Presentation transcript:

SNP Haplotype reconstruction Statistics 246, 2002, Week 14, Lecture 2 Not complete

The problem We start with a collection of genotypes in the form of allelic determinations at tightly linked single nucleotide polymorphisms (SNPs) for each of a set of n individuals. For example, we might describe 3 SNPs as follows: Name SNP alleles (major, minor) SNP1 T, A SNP2 A, G SNP3 C, G An individual might have genotype AT at SNP1, AA at SNP2, and CG at SNP3, which we will denote by AT//AA//CG. Possible haplotype pairs for this person are AAC/TAG and AAG/TAC, and without further information, we can’t distinguish between these two pairs.

The problem, cont. What can be done? With information on the individual’s parents, we can usually infer the haplotypes, the only problem being that the parents may not be fully informative. For example, if the maternal and paternal genotypes were TA//AA//CC and TT//AA//CG respectively, at SNPs 1, 2 and 3, and the individual is AT//AA//CG, then it would be clear that the haplotypes were AAC/TAG (why?). On the other hand, if the parents both had genotypes AT//AA//CG, then we wouldn’t be able to determine unique haplotypes for the individual. Even in the first case, we might have to make an assumption about the frequency of recombination: what is it? Our problem here is to determine haplotypes, or make good guesses at them without parental genotypes.

Origin of the problem Why do we want to determine haplotypes for individuals at tightly linked SNP loci? From the class: a)Haplotypes are more powerful discriminators between cases and controls in disease association studies. Why? b) With haplotypes we can conduct evolutionary studies. c) Use of haplotypes in disease association studies reduces the number of tests to be carried out, and hence the penalty for multiple testing. Is this the same point as a)? d) (From me) Haplotypes are necessary in linkage analyses. Other reasons?

Two aspects of the problem With a random sample of multilocus genotypes at a set of SNPs, we can attempt a) to estimate the frequencies of all possible haplotypes, and b) to infer the haplotypes of all individuals. The first step on this problem was taken by A Clark in 1990. He gave what we might call a parsimony solution to b) above. It goes like this. With a reasonable sample size, we might expect to have some individuals homozygous at every locus, e.g. TT//AA//CC, or heterozygous at just one locus, e.g. TT//AA//CG. With the individuals of former type, we have unambiguously identified one (TAC), and of the latter type two (TAC and TAG) haplotypes present in the population. The algorithm begins by finding all homozygotes and single SNP heterozygotes and tallying the resulting known haplotypes.

Now proceed as follows. For each known haplotype, look at all remaining unresolved cases, and ask whether the known haplotype can be made from some combination of ambiguous sites from an unresolved case. For example, if we have identified TAC as a known haplotype from a TT//AA//CC homozygote, and we have an individual AT//AA//CG still unresolved, then we infer that s/he is TAC/AAG, and we have have “resolved” this person’s haplotype and added a putative haplotype to our list. Similarly, a TT//AA//CG individual gives us both TAC and TAG as known haplotypes, and both of these go into the initial list. This chain of inferences is continued until either all haplotypes have been recovered, or until no more new haplotypes can be found in this way. A Clark’s algorithm, cont.

Clark’s algorithm with SNPs in practice This method should work in principle, but there are three problems that might arise in practice: a) there may be no homozygotes or single SNP heterozygotes in the sample, and so the chain might never get started; b) there may be many unresolved haplotypes left at the end; and c) haplotypes might be erroneously inferred if a crossover product of two actual haplotypes is identical to another true haplotype. The frequency of these problems will depend on average heterozygosity of the SNPs, number of loci, their recombination rates and the sample size. Clark (1990) did some calculations and simulations which led him to believe the algorithm would perform well, even with relatively small sample sizes. And it did.

The EM algorithm solution to problem a) We now describe an EM algorithm to infer haplotype frequencies in a population on the basis of a random sample and the assumption of random mating for haplotypes. Escoffier and Slatkin (1995) call the phase unknown multilocus genotypes, e.g. TT//AA//CG, phenotypes, and keep the term genotype for the corresponding haplotype pair TAC/TAG. Others use the term diplotype for a pair of haplotypes, but neither of these has caught on. The observed data in a random sample of n individuals will be multilocus genotype frequencies, and the natural model is multinomial. The number c of haplotype pairs leading to a given multilocus genotype will depend on the number s of heterozygous SNPs, and will be 2 s-1. E.g if our genotype is TT//AA//CG, then we can recover the haplotypes unambiguously (c=1), but for our original case, AT//AA//CG, there were c = 2 possible haplotype pairs. Under the assumption of random mating, the probability of a given genotype is just the sum of 2 s-1 squares or products of haplotype probabilities, e.g. pr(AT//AA//CG) = pr(AAC/TAG) + pr(AAG/TAC) = pr(AAC)pr(TAG) + pr(AAG)pr(TAC).

The EM solution, cont. Straightforward, history, performance in recovering frequencies, and in recovering true haplotypes.

The EM algorithm: performance with SNPs Summarize Fallin & Schork (2000)

A Gibbs sampler approach to haplotype reconstruction, Stephens et al (2001) Our notation is as follows: G = (G 1,…, G n ) denotes the observed multilocus genotype frequencies, H=(H 1,…,H n ) will be the corresponding unknown haplotype pairs, while F=(F 1, …, F M ) will denote the M unknown population haplotype frequencies. In these terms, the EM algorithm sought that F which maximized pr(G|F). The Gibbs sampler uses the following three steps, starting from some initial haplotype reconstruction H (0) : i) Choose an individual i, uniformly and at random from all ambiguous individuals; ii) Sample H i (t+1) from pr(H i | g, H -i (t) ), where H -i is the set of haplotypes excluding individual i; iii) Set H j (t+1) = H j (t) for j=1,…,i-1,i+1, …,n. General theory tells us that this produces a Markov chain with the desired stationary distribution. The question is: what is pr(H i | g, H -i (t) )?

The Stephens et al Gibbs sampler, some details For any haplotype pair H i = (h i1,h i2 ) consistent with genotype G i, write pr(H i | G, H -i )  pr(H i | H -i )  pr(h i1 |H -I )pr(h i2 | H -I, h i1 ) (*) where pr(.|H) is the conditional distribution of a future sampled haplotype, given a set H of previously sampled haplotypes. There are a couple of choices here, one assuming that the type of a mutant offspring is h with probability h, independent of the type of the parent. In that case, for a constant-size random mating population, pr(h|H) = (r h +  h )/(r +  ), where r h is the number of haplotypes of type h in H, r is the total number of haplotypes in H, and  is the scaled mutation rate.

The Stephens et al Gibbs sampler, more details In principle we can substitute this formula into step ii) of the generic Gibbs algorithm, and off we go. In practice, the number of possible values of Hi is too large: 2 s-1, where s is the number of loci at which individual i heterozygous. However, if we take h =1/M, where M is the total number of different possible haplotypes that could be observed in the population, we can exploit the fact Incomplete.

The Stephens et al Gibbs sampler, yet more details Stephens et al (2001) offer a second algorithm, based on slightly different population genetic modelling. It takes the form pr(h | H) = ∑ {  E} ∑ {s ≥0} [r  /r][  /(r+  )] s [r/(r+  )](P s )  h, where r  is the number of haplotypes of type  in the set H, r is the total number of haplotypes in H, and  is the scaled mutation rate. Here E is the set of types for a general mutation model, and P a reversible mutation matrix. Informally, this corresponds to the next sampled haplotype h being obtained by applying a random number of mutations s to a randomly chosen existing haplotype , where s is sampled from a geometric distribution. This related to sampling from what is known as the coalescent, see Stephens et al (2001) for details and references. The algorithm uses the above expression in (*) a few slides back. There are several issues that need to be dealt with here: estimation of , dealing with the fact that the dimension of P is M, the number of possible haplotypes, etc.

An alternative Gibbs sampler, Nui et al (2002) These authors are from the Bayesian school, who assume Dirichlet priors with parameters  = (  1, …,  M ), for the haplotype frequencies F = (f 1, …, f M ). As with the EM approach, our model is multinomial, being a product over individuals in the sample, of phase unknown multilocus genotype probabilities, which in turn are sums of products of pairs of haplotype probabilities. The problem, as always, is in the large number of haplotypes which are compatible with a given genotype. More fully, thinking of the haplotypes H as “missing”, we can write pr(G,H | F)  ∏ {i=1,..n} f hi1 f hi2 ∏ j f j  j - 1. The steps are familiar: conditional on F, sample h i1 and h i2 for individual i according to (**) below; then sample F given H updated. (**) pr(h i1 =g, h i2 = h|F, G i ) = f g f h /∑ {g’  h’=Gi} f g’ f h’. Here g’  h’=G i means that g’ and h’ combine to give G i.

Alternative Gibbs sampler, cont. The advantage of this approach is that it is independent of complex and potentially dangerous population genetic modelling assumptions. Furthermore, tricks we have met before and some new ones can be used to improve the efficiency of the algorithm. Predictive updating. Here, as elsewhere, we integrate out the parameters F to improve the Gibbs sampler. In this case pr(G,H)   [|  +N(H)|]/  [  +N(H)], where we are using the abbreviated gamma function notation once more, and where N(H) is the vector of haplotype counts. Now we can use a different, easier sampler: pick an individual i at random or in a certain order, and update his/her haplotype h i by sampling from pr(h i = (g,h)|H -i,G)  (n g +  g )(n h +  h ), where n g and n h are counts of g and h in H -i, respectively.

Alternative sampler, cont: Ligation. Handling a large number of loci and hence haplotypes still presents a challenge. It is natural to adopt a divide and conquer approach: solve the problem for small blocks of loci, and then piece together the solutions. Niu et al suggest partitioning L loci into blocks of about 8 loci. They offer two strategies: progressive ligation and hierarchical ligation. In both cases, the first step is to carry out block level haplotype reconstruction using the sampler just described. They then record the B most probable (partial) haplotypes for each block, between 40 and 50 in their examples, and proceed to join them. Progressive ligation begins at one end and pieces consecutive pairs together, using the Gibbs sampler restricted to longer haplotypes which are among the B 2 combinations from the two most probable sets of B partial haplotypes. This process is then continued until all the blocks are joined. Hierarchical ligation is analogous, but working across the whole length, see Figure on next slide. It is worth pointing out that these strategies not only deal with the large state space, they also help the Markov Chain to converge more rapidly.

Schematic depicting ligation. Here there are L loci, initially in segments of K, while  is the highest level of the pyramidal hierarchy.

Alternative sampler completed Prior annealing. A nice trick used in Niu et al to enable the Gibbs sampler to more move freely in haplotype space is to use high pseudo-counts at the beginning of the iterations, and progressively reduce them at a fixed rate as the sampler continues. Their formula for T iterations is  (t) =  (0) + t(  (T) -  (0) )/T. Missing marker data. The absence of both alleles of an SNP marker is common owing to PCR dropouts. Another concern is the “one-allele” problem, in which only one allele is unscored. The Gibbs sampler adapts nicely to having multiple categories of errors and dealing with these in the sampling.

References A G Clark. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7: 111-122. 1990 L Escoffier and M Slatkin. Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: 921-927, 1995. D Fallin and NJ Schork. Accuracy of haplotype frequency estimation for biallelic loci, via the expectation maximization algorithm, for unphased diploid genotype data. Am J Hum, Genet. 67:947-959, 2000. M Stephens, NJ Smith and P Donnelly. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978-989, 2001. T Niu, ZS Qin, X. Xu and J Liu. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70:157-169, 2002.

Download ppt "SNP Haplotype reconstruction Statistics 246, 2002, Week 14, Lecture 2 Not complete."

Similar presentations