Presentation on theme: "1 Environmental and heritable factors in the causation of cancer. The genetic epidemiology of cancer: Interpreting family and twin studies Week 4, Stat."— Presentation transcript:
1 Environmental and heritable factors in the causation of cancer. The genetic epidemiology of cancer: Interpreting family and twin studies Week 4, Stat 246, 2002 Background to and discussion of: Lichtenstein et al, NEJM 343 2000: 78-84, and Risch, Cancer Epi., Biom. & Prev. 10 2001:733-741
3 The papers in brief Lichtenstein et al (2000). Combined data on 44,788 pairs of twins listed in the Swedish, Danish and Finnish twin registries in order to assess the risks of cancer at 28 anatomical sites for the twins of persons with cancer. Statistical modeling was used to estimate the relative importance of heritable and environmental factors in causing cancer at 11 of those sites. Risch (2001). Offers a reassessment of the role of genetic factors in cancer susceptibility generally and for site-specific cancers in particular. Presents an detailed critique of Lichtenstein et al (2000).
4 Summary of conclusions Lichtenstein et al. “Inherited genetic factors make a minor contribution to susceptibility to most types of cancers. This finding indicates that the environment has the principal role in causing sporadic cancers.” Risch. “ a) All cancers are familial to approximately the same degree, with only a few exceptions; b) early age of diagnosis is generally associated with increased familiality;c) familiality does not decrease with decreasing prevalence of the tumor- in fact the trend is toward increasing familiality with decreasing prevalence; d) a multifactorial (polygenic) threshold model fits the twin data for most cancers less well than single gene or genetic heterogeneity-type models; e) recessive inheritance is less likely generally than dominant or additive models; f) heritability decreases for rarer tumors only in the context of the polygenic model but not in the context of single-locus or heterogeneity models; g) although the family and twin data do not account for gene-environment interaction or confounding, they are still consistent with genes contributing high attributable risks for most cancer sites.”
5 Setting the scene, I Lichtstenstein et al use the multifactorial (polygenic) threshold (MFT) model, and infer the relative contributions of heredity and environment within that model. Their analysis rests on “the usual assumptions of a classic twin study (that there was random mating, no interactions between genes and environment, and equivalent environments for monozygotic and dizygotic twins). Phenotypic variance was divided into a component due to inherited genetic factors (heritability), a component due to environmental factors common to both members of the pair of twins (the shared environmental component), and a component due to environmental factors unique to each twin (the nonshared environmental component)”.
6 Setting the scene, II By contrast, Risch makes extensive use of familial risk ratios (FRRs). These are quantities denoted by R, where R denotes a relationship (S=sib, O=offspring, DZ= dizygotic twin, etc), and whose values are the risks of relatives of type R of affected individuals being themselves affected (here by cancer), divided by the population prevalence. A way to think of R is as pr(affected | R affected)/pr(affected), the ratio of the probability (risk) of someone being affected, given that their relative of type R is affected, divided by the unconditional probability of that person being affected. In this view, it is entirely analogous to the coincidence coefficient we met in the study of interference. If we denote the population prevalence of our trait by K, and the frequency of affected pairs with relationship R by K 2, then R = K 2 / K 2.
7 More on familial risk ratios They can be estimated directly from family data, and can also be studied theoretically, by calculating them under different assumptions concerning penetrances and susceptibility allele frequencies. In particular, we can estimate R, where R = MZ and R=DZ from twin data, and also study the behaviour of these quantities under different genetic models, e.g. a single rare dominant gene causing susceptibility, or a recessive gene, whose susceptibility allele frequency ranges from very common to very rare.
8 Points to consider when comparing two types of models for the involvement of genes in disease susceptibility How do the models relate to our current understanding of genetics in general, and that of disease susceptibility in particular? How interpretable are the models’ parameters? How do the models relate to available data? Do they fit? Does their qualitative behaviour reflect broadly observed trends?
9 Single factor models Suppose that the levels of a quantitative trait in a population are influenced by a gene at a locus at which two (unobserved) alleles A and a are segregating. Suppose further (with Weinberg and Hardy) that the population frequencies of types AA, Aa and aa are, p 2, 2pq and q 2, respectively, where 0
"name": "9 Single factor models Suppose that the levels of a quantitative trait in a population are influenced by a gene at a locus at which two (unobserved) alleles A and a are segregating.",
"description": "Suppose further (with Weinberg and Hardy) that the population frequencies of types AA, Aa and aa are, p 2, 2pq and q 2, respectively, where 0
10 One-factor models: the joint distribution of parent-offspring genotypes AA Aa aa AA p 3 p 2 q 0 Aa p 2 q pq pq 2 aa 0 pq 2 q 3 Offspring genotype Parental genotype Exercise: Obtain the above table and the corresponding table for full sibs.
11 One-factor models: parent-offspring correlation Now suppose that deviations from the population mean of the trait for individuals with genotypes AA, Aa and aa are u, v and w, respectively. With Yule (1906) we put p=1/2. Then the correlation between the trait values of P and O is corr(P,O) = (u-w) 2 / [2(u-w) 2 + (u-2v+w) 2 ]. Additivity: u+w=2v, corr(P,O) = 1/2. Dominance: v=w, corr(P,O) = 1/3. This calculation reconciled the Mendelian and Biometric schools at the turn of the last century. Exercise: Derive the above and obtain corr(S,S’) for full sibs S, S’ in the same way. (Remember that u+2v+w=0. Why?) Redo with general allele frequencies.
12 Multifactorial (polygenic) models In general, we can postulate arbitrarily many segregating loci, with arbitrary numbers of alleles, and arbitrary allele frequencies, though still satisfying Hardy-Weinberg equilibrium, each locus contributing independently to the quantitative trait, and repeat the analysis just presented. This was done by RA Fisher in a 1918 paper which laid out the framework that underlies the MFT model. One result is a number of formulae like the following: cov(MZ twins) = V A + V D, cov (DZ twins) = 0.5 V A + 0.25V D where V A and V D are called the additive and dominance variances, respectively. The total variance of the quantitative trait in this context is V = V A + V D + V E, where V E is called the environmental variance. An extra twist in twin studies is a further term V S for shared family environment effects which are postulated to be uncorrelated with the general environmental effects contributing to V E.. In what follows, we suppose that V D = 0, and adopt the terminology that calls H= V A /V the heritability of the quantitative trait.
13 Threshold models for disease susceptibility The analysis of RA Fisher was all for quantitative traits. But most diseases are qualitative traits: you get them, or you don’t. People wanted to use the preceding framework for disease traits, and they did it by postulating an unobserved liability to which the foregoing genetic analysis applies. They then suppose an individual becomes affected when their liability exceeds a threshold T. In practice, the population distribution of liabilities is always assumed to be normal N(0,1), and T is determined by the population prevalence K, e.g. if K = 1%, then T=2.33. When pairs of individuals are being considered, the joint distribution of their liabilities is always assumed to be bivariate normal BN(0,0; 1,1; ), where is the appropriate multiple (1.0 for MZ, O.5 for DZ twins) of the additive variance, which here is just the heritability, as the total variance is 1. Thus from a knowledge of the prevalence and the heritability of the liability of a disease trait, and the relationship between two individuals, we could calculate the chance under the MFT model that neither, one and not the other, or both will be affected (see later). In practice, the reverse is done: from the frequencies of such concordant and discordant relative pairs, people can estimate heritability under the MFT model, and this is exactly what Lichtenstein et al (2000) did.
14 MFT models for disease susceptibility Now let’s get to the MFT model used in their analysis by Lichtenstein et al. As you see from p. 81 of their paper, their data for a particular type of cancer, say stomach cancer, comes in the form of counts for male and female MZ and DZ twin pairs, namely a concordant affected pairs, b and c both half the number of discordant pairs (one affected, one unaffected), and d=n-a-b-c concordant unaffected pairs. They find it convenient to form a 2 2 table with entries a, b, c and d. These counts were also stratified by country in their analysis, but presented in aggregated form in the paper.
15 Lichtenstein et al’s analysis of 12 11 sets of 2 2 tables For each country (3), sex (2) and zygosity (2), and each of 11 cancer sites or types, L et al have a 2 2 table of counts. + - + a b - c d Here a is the number of concordant affected twin pairs, etc. Suppose that this table corresponds to DZ twins, say, from a country where the liability threshold for affectedness is T, and the components of variance in the bivariate normal model for the liability are V A (additive), V S (shared environment), and V E (nonshared environment), all adding to 1. The correlation for DZ twins’ liabilities is = 0.5 V A +V S, with 1.0 instead of 0.5 for MZ twins. Their joint analysis of 12 2 2 tables for a given site or type assumes country-dependent thresholds, but common, sex-specific variance components.
16 Summary data and results for stomach cancer Twin pairs a b+c RR CI Concordance Men MZ 6 131 9.9 (4.1-23.6) 0.08 Men DZ 8 256 6.6 (3.2-13.8) 0.06 Women MZ 5 92 19.7(7.5-51.6) 0.10 Women DZ 4 198 6.2 (2.2-17.1) 0.04 V A = 0.28 (0.00-0.51) (not sig. diff from 0) V S = 0.10 (0.00-0.34) V E = 0.62 (0.49-0.76) Goodness-of-fit 2 = 8.9 on 38 d.f.
17 Comment on relative risks L et al calculate a conventional odds ratio they term relative risk R = ad/bc from the 2 2 table entries, plus a confidence interval for R, as well as a twin concordance 2a/(2a+b+c) for the absolute risk.The second makes sense, but the first doesn’t, as the division of discordants into two equal-sized groups to form the 2 2 table is artificial. They should be using Risch’s DZ and MZ,see later.
18 Comment on the model fitting and residual degrees of freedom What should the residual d.f. be? I count 12 = 3 (countries) x 2 (sexes) x 2 (twin types) sets of either 3 (my view) or 4 numbers: a=#concordant affected, b+c (or b, c) discordants, and d=n-a-b-c concordant unaffecteds. I say 3 here, as there is no real way to split the discordants, since twins are unlabelled. So I get 36 numbers, actually only 24 freely varying ones, since one of each triple is determined by the other 2 and n. The other calculation gives 48, but really only 36. How many parameters are estimated? I count 3 thresholds, and either 2 or 4 freely varying variance components (say additive and shared environment) the remainder comes by subtracting the first two from 1. The number 2 is right if we have pooled across sexes, as Table 3 suggests, otherwise 4 for sex specific analyses. However, not one of 24, 36 or 48 minus 3+2 or 3+4 gives 38! Any ideas?
19 Some comments from the paper “The statistical model we used provided an excellent fit to the observed data.” “Although the model fitting can be used to estimate the magnitude of the heritable component of susceptibility to cancer, it cannot reveal how this component acts or how it interacts with other factors.” “…we cannot exclude a modifying effect of environment on the genetic component found in our analyses of twins.”
20 A cautionary remark “…the study of twins, from being regarded as one of the easiest and most reliable kinds of researches in human genetics, must now be regarded as one of the most treacherous.” From L S Penrose, Outline of human genetics Heinemann, 1959, quoted in Pak Sham, Statistics in human genetics Arnold, 1998.
21 Additional references Michael C Neale and Lon R Cardon Methodology for genetic studies of twins and families Kluwer, 1992. A very comprehensive treatise on the methods of Lichtenstein et al (2000). Arthur S. GOLDBERGER and Leon J. KAMIN Behavior-Genetic Modeling of Twins: A Deconstruction SSRI Working Paper #9824 University of Wisconsin, 1998. http://www.ssc.wisc.edu/econ/archive/wp9824.htm As it says, a deconstruction.
22 Risch’s critique: a beginning First, some facts about the familial risk ratios. If genetic susceptibility is attributable to a single (rare) dominant gene, then P = O = S = DZ = ( MZ +1)/2, and so R MD = ( DZ -1)/ ( MZ -1) = 2. If susceptibility is attributable to a recessive gene, then P = O 2, depending on the allele frequency. For a rare allele, R MD = 4, but diminishes toward 2 if the allele is very common. (Exercise: Derive these facts.)
23 Proofs of results on previous page As in the calculation on p.10, we need to enumerate all 9 mating-pair types, and their population frequencies, and then calculate the chance that they have 2 affected sibs, under a singe rare dominant gene model. Then we form R MD. The algebra is ugly, apart from the quantity K= p 2 +2p(1-p). Clearly (why?) MZ = 1/K, while DZ is a quartic polynomial in p. If we go directly to R MD = ( MZ -1)/ ( DZ -1), we see that it is K(1-K)/(K 2 -K 2 ). The numerator and denominator are both quartics in p, but we only need the leading terms, which are 2p and p, respectively, so that for small p their ratio is 2 (Why?). Exercise: Obtain the second result cited on p.22.
24 Penetrance functions Recall that penetrance is the chance of being affected, given genotype, and can be written x i for genotype i. Various new penetrance functions can be built up from single locus penetrances. Examples include: with phenocopies: x’ i = min(1, x i + ) multi-locus models: (here i, j, k..refer to different loci) Multiplicative: x ijk… = x i x j x k …. Additive x ijk… = min(1, x i +x j +x k +….) Genetic heterogeneity x ijk.. =1-(1-x i )(1-x j )(1- x k )(…
25 Non-genetic cases, more than one gene. How do the foregoing calculations change if there is a proportion of non-genetic cases (so-called phenocopies) of our disease, or if more than one gene influences susceptibility.Risch asserts that phenocopies “do not influence the predictions given above”. Can you see why? Further, he asserts that if we have locus heterogeneity, the same predictions hold, and that this is also true if we have additivity of risks (penetrances). On the other hand, if epistasis (interaction) is present among different loci, e.g. if penetrances are multiplicative, things can be very different. See Risch’s 1990 paper for fuller details on these issues.
26 Familial relative risks and the MFT model We saw a few pages back that DZ = K 2 / K 2 where K 2 is the probability of both twins having liabilities exceeding the threshold T, a bivariate normal integral, whose correlation is 0.5V A + V S. Risch’s Table 1 relates K, V A = H, DZ and MZ when V S = 0, and draws two important conclusions: For a fixed value of, the heritability H decreases with decreasing K; For the MFT model, R MD is always > 2, and increases directly with heritability H and inversely with prevalence K.
27 MZ, DZ and the MFT model with shared environment We are interested in the impact of shared environment on R MD under the MFT model. This isn’t dealt with very neatly by Risch, who says: “For the case where the shared familial environment is equivalent between MZ and DZ twin pairs, R MD will be attenuated: if R MD = 2 without an equivalent shared environment, then R’ MD < 2 with it.” To investigate this we recalculate Risch’s Table 1 with a shared environment term in the variance.
34 Evidence of familiality in cancer Risch’s Table 2 contains values of 1, the familial relative risk for 1st degree relatives, at about 25 sites, using data from Utah, and values of O and S for data from Sweden. In about 10 cases, there is a separate 1 for early onset cancers. He derives 3 observations from these numbers: Apart from a few exceptions, the FRRs are all rather similar, being around 2 in both studies. All the FRRs are >1, and in the majority of cases, between 1.5 and 3.0. There is no decline in FRR with decreasing frequency of the cancer site, if anything, the reverse. Increased family recurrence associated with early diagnosis.
35 Evidence from the twin data of Lichtenstein et al (2000) Risch is able to use the Lichtenstein et al data to estimate his MZ and DZ, see later for how it is done. For the stomach cancer data, he finds: K = 1%, MZ = 8.49, DZ = 5.96, and R MD = 1.51, (see next slide). Doing this for a number of sites (results in Table 3), he finds that the values of MZ and DZ are reasonably consistent across sites and do not decrease with K. Risch goes on to fit a constant risk ratio model to the data, excluding female breast cancer. The results are given in Table 4, and simply formalize the obvious.
36 Twin relative risk calculations Suppose that in a population we have a concordant twin pairs, (i.e. both affected with the given cancer), b+c discordant pairs (one affected, one unaffected), and d concordant unaffected twin pairs, n=a+b+c+d. Since K 2 = 2a/2n, and K = (2a+b+c)/2n, we have R = Twin RR = K 2 / K 2 = 2a/2n / [(2a+b+c)/2n] 2 = 4an / (2a+b+c) 2. (In Risch p.737 a=c and b+c=d) For stomach cancer, male MZ twins have a=6, b+c=131 and n=7,231. We find that MZ = 8.49.
37 Risch’s conclusions from his reanalysis of data from Lichtenstein et al. “… the conclusion that rarer cancers are less heritable is strictly a consequence of the MFT model and is not robust to violations from that model.” “Thus the observed value of R MD conforms poorly to the predictions of the MFT model but extremely well to the single locus or additive genetic model” “Hence the conclusion of a significant shared twin environmental component may simply be a consequence of using the wrong genetic model..” (There are some important comments about age-structure which we’ll pass over for the moment.)
38 Heritability versus Attributable Risk. Heritability is used as a measure of the importance of genetic effects in the context of the MFT model. Small values lead to the conclusion that genetic effects are minor relative to environmental impacts. But what if the MFT does not apply? Epidemiologists use relative risk RR (risk to exposed versus unexposed individuals) and PAF (proportion of disease prevented by elimination from the population of the risk factor). Table 5 presents the values of RR Het and PAF for 2 values of MZ and DZ and allele frequencies ranging from 0.001 to 0.10. It shows that the PAF can range from small values to 100%, depending on disease allele frequency. Similarly, quite high values of RR Het can arise.
39 Conclusions In my view, Risch has demonstrated that his conclusions are justified, in particular, that the MFT model is not a helpful framework within which to assess the contribution of heredity to cancer. For more on FRRs, see: N. RISCH Linkage strategies for genetically complex traits. I. Multi-locus models. Am. J. Hum, Genet. 46: 222-228, 1990.
40 Acknowledgement Many thanks to Ingileif Hallgrímsdóttir for helping out with class this week.