# 1 Statistical Considerations for Population-Based Studies in Cancer I Special Topic: Statistical analyses of twin and family data Kim-Anh Do, Ph.D. Associate.

## Presentation on theme: "1 Statistical Considerations for Population-Based Studies in Cancer I Special Topic: Statistical analyses of twin and family data Kim-Anh Do, Ph.D. Associate."— Presentation transcript:

1 Statistical Considerations for Population-Based Studies in Cancer I Special Topic: Statistical analyses of twin and family data Kim-Anh Do, Ph.D. Associate Professor Department of Biostatistics Email: kim@mdanderson.org http://odin.mdacc.tmc.edu/~kim

2 The usual idea of a gene is a specific region of DNA that codes for a single protein or enzyme, and the position of a gene on a chromosome is its locus. The basis for research by human geneticists is to try to identify traits, or phenotypes, whose inheritance patterns are consistent with the action of individual genes. Recent advances in genetics show that the relationship between DNA sequence and phenotype is both more complex and more interesting than we thought. Some functions of DNA do not even depend on its nucleotide sequence, and DNA sequence variation includes a variety of direct and indirect forms of feedback among various regions of the DNA within and between cells.

3 Allele and genotype frequenquencies The most fundamental quantitative variable in population genetics is the allele frequency, a prevalence measure. When a locus has only two alleles, denote their frequencies p and q=1-p. Let P g define the frequency of genotype g The frequency of an i homozygote is P ii = p i  p i = p i 2 The frequency of an ik heterozygote is P ik = 2 p i p k For a diallelic system the genotypes have frequency P AA = p 2 P Aa = 2pq P aa = q 2

4 Frequenquency relationships between genotype and phenotype The concept of penetrance A given genotype does not always produce the same phenotype. The association between the two is known as the penetrance. Individuals with a given genotype will have some distribution of phenotypes; the penetrance function specifies the probability that an individual with genotype g has phenotype   g (  ) = Pr(  |g)

5 Frequenquency relationships between genotype and phenotype (cont’d) For many quantitative biological traits there is some measurement scale on which the phenotypes are approximately normally distributed.  g (  ) = {1/[  g  ( 2  )]} exp[-(  -  g ) 2 / 2  g 2 Penetrance is a statistical, population-specific association between genotype and phenotype, not a biological explanation of such a relationship. Many factors may affect the expression of a given genotype: genes, environmental factors, errors in measurement or classification, sampling error etc.

6 Nuclear families and sibships The distribution of traits in families A diploid, sexually reproducing organism has two sets of genes, one inherited from each parent. Each time that individual produces his/her own gamete (sperm or egg), one of his/her inherited alleles, at each locus, will be randomly chosen and transmitted in the gamete. There is thus a probability of ½ that an offspring will inherit a specific parental allele. THIS probabilistic aspect of inheritance IS A FUNDAMENTAL ASPECT OF OUR BIOLOGY.

7 Segregation analysis: discrete traits in families We can understand the basic principles of genetic epidemiology by studying the behavior of alleles at a single locus in nuclear families. We can take advantage of evolution-based constraints on the distribution of genetic variation in families. The analysis of trait distributions in families is known as segregation analysis after Gregor Mendel’s Law of Segregation of individual alleles at a locus. The idea is to judge if the pattern of phenotypes in families is consistent with a genetic model. Families are ascertained via one or more index individuals, or probands, who may be either randomly identified, or chosen because of their disease or other phenotype status.

8 Nuclear families and sibships The distribution of traits in families A diploid, sexually reproducing organism has two sets of genes, one inherited from each parent. Each time that individual produces his/her own gamete (sperm or egg), one of his/her inherited alleles, at each locus, will be randomly chosen and transmitted in the gamete. There is thus a probability of ½ that an offspring will inherit a specific parental allele. THIS probabilistic aspect of inheritance IS A FUNDAMENTAL ASPECT OF OUR BIOLOGY.

9 Nuclear families and sibships (cont’d) Transmission probabilities For a single diallelic locus with alleles A and a, define the transmission probabilities t(x|g), as the probability that a parent of genotype g produces a gamete with allele a. These are conditional probabilities because they depend on the genotypic state of the parent. For autosomal loci t(A|AA) = 1, t(A|Aa) = ½, t(A|aa) = 0.

10 Table 5.1A. Genotypic mating table for an autosomal diallelic locus

11 Nuclear families and sibships (cont’d) Mating types The probability that an individual has a given genotype is determined by the genotype, or mating types, of its parents A nuclear family is a set of repeated selections of offspring genotypes from the mating type, M k l, of parents with genotypes k and l. In a population (or sample), 0 <= Pr(M k l ) <= 1 ;  k  l Pr(M k l ) = 1; est(M k l )= n k l /N. If there is random mating relative to the locus in question, the mating type frequencies are determined by the genotype frequencies (determined by the allele frequencies)

12 Nuclear families and sibships (cont’d) Transition probabilities Family data consists of parent-offspring triads. Define transition probabilities P(g o |g f, g m ) as the conditional probabilities of genotypes in offspring given those in the father and mother. For a diallelic locus, there are three possible offspring genotypes (AA, Aa, aa) with transition probabilities t(A|f) t(A|m) t(A|f) (1- t(A|m)) + t(A|m) (1- t(A|f)) (1 - t(A|m)) (1- t(A|f)) See Table 5.1B

13 Table 5.1B. Parent to offspring transition probabilities for a diallelic locus

14 Table 5.2. Phenotypic mating table for an autosomal diallelic locus

15 Segregation analysis: discrete traits in families (con’t) Ascertainment bias and correction: sibship data The way in which families are ascertained can have major effect on the interpretation we make of the data. Example: Ascertain affected children through the school system. Collect data on all siblings of affected. Suppose the segregation proportion (alsp the prob that a rnadom offspring is affected) is . The probability that a family of sibship size s produces r affected children follows a binomial distribution Pr(r|s,  ) = s!/[r!(s-r)!]  r (1-  ) (s-r) Therefore the probability that such a family will produce s normal children is (1-  ) s. These families will never be identified if we ascertain sibships through affected school children.

16 Ascertainment bias and correction: sibship data (con’t) Must correct for ascertainment to obtain unbiased estimates. One simple way: recognize that our sample contain all families, except those with no affecteds, I.e. our sample represents a fraction [1- (1-  ) s ] of the total population of sibships in this example. The corrected probabilities of r affected from a family of size s Pr(r|s,  ) = s!/[r!(s-r)!]  r (1-  ) (s-r) / [1- (1-  ) s ]. Another way of ascertainment correction is to perform analyses ignoring the affected probands. This is acceptable only if the probability that a given affected child is ascertained is small. Other ascertainment problems: Families with many affecteds may have a higher chance of being ascertained by a given sampling scheme. Corrections for some simple sampling situations have long been known in medical genetics, but methods for complex situations are still inexact.

17 Segregation analysis: quantitative traits in families Quantitative traits may be affected by a large number of loci acting together, as well as by environmental factors. Examples of important disease related traits: Blood pressure; obesity measures; cholesterol; triglycerides. We need to understand the effect of the genotypes, and the environment, on the phenotype. The effects of genotypes on quantitative phenotypes are relative: Does phenotype AA increase the phenotype, or does aa decrease it?

18 Segregation analysis: quantitative traits in families The simplest measure of genetic effect is the genotypic value, the mean phenotype observed amongst individuals with a given genotype in the population of reference  g =  i  i  g (  i ) The mean number of doses of a given allele, say A, in genotypes in a population is  g = 2 p 2 + 2pq (1) + q 2 (0) = 2p The mean phenotype in the population is the weighted average  =  g P g  g = p 2  AA + 2pq  Aa + q 2  aa for a diallelic locus

19 Genetic variation for a quantitatitve trait The genotypic variance is defined as the variance among the genotypic values in the population:  g 2 =  P g (  g -  ) 2 =  P g  g 2 -  2 = 2pq It is often convenient to express genotypic values as deviations from the population mean denoted by g =  g -  In the simplest situation, the effects of the individual alleles are additive, and the genotypic value is the sum of the effects of the two alleles in the genotype.

20 Genetic variation for a quantitatitve trait (cont’d) Define  i to be the allelic value that each allele contributes to the genotype. Since allele A is paired with another A a fraction p of the time, and with a for q of the time, we have  A = p AA + q Aa  a = p Aa + q aa Special characteristic of effects expressed as deviations: Their average over all genotypes must be zero, I.e p  A + q  a = 0. When the allelic effects are additive, the breeding value, or average deviation, of genotype ik is  I +  k.

21 Genetic variation for a quantitatitve trait (cont’d) Define the additive genotypic variance,  2 A, as the sums of squares of the breeding values, weighted by the genotype frequencies  2 A = p 2 (2  A ) 2 + 2pq (  A +  a ) 2 + q 2 (2  a ) 2 = 2(p  2 A + q  2 a ) Define the dominance displacement d as the position of the heterozygote relative to the two homozypotes d = (  Aa -  aa ) / (  AA -  aa ) If the effects are purely additive, the heterozygote genotypic value will be exactly halfway between those of the homozygote, I.e. d=1/2. The dominance variance is the variance due to dominance deviations from additivity and equals  2 D = p 2 ( AA - 2  A ) 2 + 2pq ( Aa -  A -  a ) 2 + q 2 ( aa - 2  a ) 2

22 Environmental effects on quantitative phenotypes Environmental factors are responsible for within genotype variance. The simplest way to account for environmental variance is to aggegate all unmeasured effects on the phenotype, usually assuming that they have a normal distribution. We can now express the determination of the phenotype as a sum of additive genetic, dominance, and environmental effects  = A + D + E with variance  2  =  2 A +  2 D +  2 E The environmental effects can ge additive, I.e. act similarly on each genotype, or there can be a genotype by environment (G  E) interaction if the same environmental exposure affects different genotypes differently (add  2 GE to the above equation).

23 Kinship and inbreeding coefficients: probabilities of shared genes Several quantities are used to measure the genetic relationship between two individuals. The coefficient of kinship, F XY, between individuals X and Y, is the probability that two alleles at the same locus, one chosen randomly from each individual, are identical by descent (ibd) from some common ancestor. The inbreeding coefficient, F, is the probability that his/her two alleles at a locus are ibd. This equals the kinship coefficient of its parents. The coefficient of relationship, r = 2 F XY, is the fraction of genes shared ibd by two individuals. Table 6.2 gives kinship F coefficients for various important kinds of relative pair.

24 Table 6.2 (Weiss) Genetic relationships among various types of relative

25 Genotypic correlation between relatives Consider the genotypic values of parents and offspring, for an additive diallelic locus. See Table 6.3. For a locus with three genotypes, there are nine possible parent-offspring genotype pairs. Example: First row of table. The probabilities of an AA father and an AA, Aa, or aa child are p, (1-p), 0 respectively, because: Note that all offsprings receive an A from father with probability 1, so offsprings cannot have genotype aa. All offsprings receive an A from the father, and an A from the mother with prob p (making their genotype AA); or an a from their mother with prob 1-p (making their genotype Aa).

26 Table 6.3. Parent-offspring relationships

27 Table 6.4 (Weiss) Components of genetic covariance for various types of relative

28 The covariances between any pair of relatives, P and Q, can be expressed as a weighted combination of additive and dominance effects. Let the parents of P be denoted by A nd B. Let the parents of Q be denoted by C and D. Cov(P,Q) = r PQ  2 A + u PQ  2 D where u PQ = F AC F BD + F AD F BC F values are kinship coeficients given in Table 6.2.

29 Extension to multiple loci: polygenic traits Fisher, 1918, showed that the single-locus genetic relationships among relatives were preserved for multiple additive loci. Example: At a single locus, there are 3 genotypes (AA, Aa, aa) and three genotypic dose values (0, 1, and 2). At two such loci, there are nine genotypes (aabb, aabB, aaBB, aAbb,aAbB, aABB, AAbb, AAbB, AABB) and 5 different genotypic values (0, 1, 2, 3, 4). In general, for n such loci there are 3 n genotypes and 2n+1 genotypic values, i.e., as n gets large, the distribution of additive genotypic values resembles the continuous distribution of a quantitative trait. In practice, the distribution of summed additive effects can be approximated by a normal distribution. The genotypic correlations between relativesalso hold for multiple additive loci.

30 Extension to multiple loci: polygenic traits (con’t) Dominance refers to non-additive (interaction) effects between alleles at the same locus. Epistasis refers to interactions among alleles at different loci. This adds another term to the expression for the determination of the phenotype  = PG + E = A + D + I + E with variance  2  =  2 PG +  2 E =  2 A +  2 D +  2 I +  2 E which can be rewritten as 1 =  2 PG /  2  +  2 E /  2  Define heritability as h 2 =  2 PG /  2 . Heritability represents the ratio of the observed phenotypic correlation to the theoretical genotypic correlation. In twins: h 2 = (  2 DZ -  2 MZ ) /  2 DZ

Download ppt "1 Statistical Considerations for Population-Based Studies in Cancer I Special Topic: Statistical analyses of twin and family data Kim-Anh Do, Ph.D. Associate."

Similar presentations