Presentation on theme: "BMI 731- Winter 2005 Chapter1: SNP Analysis Catalin Barbacioru Department of Biomedical Informatics Ohio State University."— Presentation transcript:
BMI 731- Winter 2005 Chapter1: SNP Analysis Catalin Barbacioru Department of Biomedical Informatics Ohio State University
Biological Background Cells are fundamental working units of every living systems The nucleus contains a large DNA (Deoxyribonucleic acid) molecule, which carries the genetic instructions A DNA molecule consists of two strands that wrap around each other to resemble a twisted ladder. Each strand is composed of one sugar molecule, one phosphate molecule, and a base. Four different bases are present in DNA - adenine (A), thymine (T), cytosine (C), and guanine (G). The particular order of the bases arranged along the sugar - phosphate backbone is called the DNA sequence
Each strand of the DNA molecule is held together at its base by weak hydrogen bonds. The four bases pair in a set manner: Adenine (A) pairs with thymine (T), while cytosine (C) pairs with guanine (G). These pairs of bases are known as Base Pairs (bp). The DNA is organized into separate long segments called chromosomes, where the number of chromosomes differ across organisms (46 for humans or 23 pairs, each parent contributes 23 chromosomes)
Glossary Allele = Alternative form of a gene. One of the different forms of a gene that can exist at a single locus. Genotype = The specific allelic composition of a cell, either of the entire cell or more commonly for a certain gene or a set of genes. Haplotype = A set of closely linked genetic markers present on one chromosome which tend to be inherited together (not easily separable by recombination).
Glossary Locus: A point in the genome, identified by a marker, which can be mapped by some means. Marker: Also known as a genetic marker, a segment of DNA with an identifiable physical location on a chromosome whose inheritance can be followed. A marker can be a gene, or it can be some section of DNA with no known function. Mutation: A permanent structural alteration in DNA.
Glossary Hardy-Weinberg equilibrium = The stable frequency distribution of genotypes, AA, Aa, and aa, in the proportions p^2, 2pq, and q^2 respectively (where p and q are the frequencies of the alleles, A and a) that is a consequence of random mating in the absence of mutation, migration, natural selection, or random drift. Linkage disequilibrium = When the observed frequencies of haplotypes in a population does not agree with haplotype frequencies predicted by multiplying together the frequency of individual genetic markers in each haplotype.
A Little Population Genetics Population genetics (and evolutionary genetics) deal with groups of organisms and families, usually natural populations. We can discern two strands of thought in the area. One is the study of very large ("ideal") idealized groups or populations, where models can be deterministic. The other is dealing with smaller populations, where the role of chance can play a larger role (so called genetic drift).
Genotype and allele frequencies One question of crucial interest is this: how common are the different alleles at a given locus in a given population. The percentages are our best estimate of the probability that an individual will carry that genotype in the population of London, Oxford and Cambridge. The observed heterozygosity is 49.6%.
There is another population described in this table. It is the population of gametes that gave rise to individuals tested: The percentages here are our best estimate of the probability that a sperm or egg taken from that population will carry that particular allele. If the frequency of the commonest allele at a particular locus is less than 99%, we call this a polymorphic locus or polymorphism.
Hardy-Weinberg equilibrium Hardy-Weinberg equilibrium describes the relationship between the gametic or allele frequencies, and the resulting genotypic frequencies. It holds if the following properties are true for the given locus, 1.Random mating or panmixia: the choice of a mate is not influenced by his/her genotype at the locus. 2.The locus does not affect the chance of mating at all, either by altering fertility or decreasing survival to reproductive age.
If these properties hold, then the probability that two gametes will meet and give rise to a new genotype is simply the product of the allele frequencies (a la binomial): P(AA)= P(A) x P(A) = p A 2 P(aa)= P(a) x P(a) = p a 2 P(Aa)= 1 - P(AA) - P(aa) = 2 x P(A) x P(a) = 2p A p a.
Tests for HWE For a two-allele case, disequilibrium coefficient is : D = P AA – p A 2 where P AA = P(AA) the probability of AA genotype and p A = P(A) is the probability of allele A. If n AA, n Aa, n aa are the numbers of individuals with genotypes AA, Aa and aa respectively, from a total of n individuals, then estimators of the above probabilities are: P AA = n AA /n, P Aa = n Aa /n, P aa = n aa /n, where n =n AA +n Aa +n aa p A = (2n AA +n Aa )/2n, p a = (2n aa +n Aa )/2n and p a + p A = 1
Chi-square test for HWE Then under HWE GenotypeAAAaaa Observedn AA n Aa n aa ExpectednpA2npA2 2np A p a npa2npa2 Obs-ExpnDnD-2nDnDnD
Chi-square test for HWE The goodness-of-fit chi-squared statistic is X A 2 = Σ genotypes (Obs-Exp) 2 /Exp = (nD) 2 /np A 2 + (-2nD) 2 /2np A p a + (nD) 2 /np a 2 = nD 2 /p A 2 (1-p A ) 2 and the test rejects (H 0 ) the assumption of HWE if X A 2 > 3.84 The usual problems associated with this test that it is sensitive to small expected values. An alternative version (Yates), which overcomes continuity assumptions is: X A 2 = Σ genotypes (|Obs-Exp|-0.5) 2 /Exp
Fisher (exact) test for HWE Under HWE hypothesis, the probability of the observed set of genotypic counts n AA, n Aa and n aa in a sample of size n is whereas the allele counts n A and n a are binomially distributed if HWE holds:
Fisher (exact) test for HWE Putting together, the probability of the observed genotypic frequencies, assuming HWE, conditional on the observed allele frequencies is which can be expressed in terms of the allele A number and Of the number of heterozygotes n Aa. We reject the HWE hypothesis if the above conditional probability is less than the significance level of type I error (α), usually 0.05.
HWE test - Example AAAaaaDProbability(exact)Chi-square 91300.16860.0000*34.67* 83290.14360.0000*25.15* 75280.11860.0001*17.16* 67270.09360.0024*10.68* 59260.06860.0229*5.74* 01921-0.0560.08233.88* 411250.0430.17932.32 11722-0.0310.41011.20 313240.0180.65850.42 21523-0.0061.00000.05 * Causes rejection of HWE at 5% significance level
Power and sample size of tests for HWE Statistical tests of hypothesis are subject to two kind of errors: a true hypothesis may be rejected (type I error or α or significance level or p-value) or a false hypothesis may not be rejected (type II error or β or 1-power of the test). For the chi-square test, theory provides that, in large samples, X 2 is distributed approximately as a chi-square with 1 d.f. when the hypothesis is true and as a noncentral chi-square when the hypothesis is false i.e. X 2 ~ Χ 2 (1) when H 0 is true X 2 ~ Χ 2 (1, λ) when H 0 is false where λ is the noncentrality parameter (see tables).
Power and sample size of tests for HWE The disequilibrium coefficient, D, required for attaining 90% power and a 0.05 significance level for the chi-square test is Alternatively, the number of samples required in order to attain 90% power and a 0.05 significance level for the chi-square test when the disequilibrium coefficient is D, is * If the required power is 50% or 80%, then 10.5 is replaced by 3.84 or 8.7
Linkage disequilibrium Gametic disequilibrium at two loci Measures the association of two alleles at two different loci. Given two biallelic loci with alleles A, a and B, b respectively, let the disequilibrium coefficient be D AB = p AB – p A p B. The (ML) estimator of D AB is D AB = p AB – p A p B. A chi-square statistic for the hypothesis of no disequilibrium, H 0 : D AB =0, is the test statistic and the test rejects H 0 if X AB 2 > 3.84.
Linkage disequilibrium Gametic disequilibrium at two loci An exact test for gametic linkage disequilibrium depends on the probabilities of all possible samples of gametic numbers for the observed allele numbers. Under the assumption of no linkage disequilibrium and the allele probabilities are
Linkage disequilibrium Gametic disequilibrium at two loci Taking the ratio between these quantities gives the probability of gametic numbers conditional on allele numbers: which depends on n, n AB, n A and n B only. As in the case of HWE, this probability is compared with the chosen significance Level (p-value).
Linkage disequilibrium Genotypic disequilibrium When genotypes are scored, it is often not possible to distinguish between the two double heterozygotes AB/ab and Ab/aB, so that the gametic frequencies cannot be inferred. Under the assumption of random mating, in which genotypic frequencies are assumed to be the products of gametic frequencies, it is possible to estimate gametic frequencies. A measure of (digenic) linkage disequilibrium between alleles A and B is:
Linkage disequilibrium Genotypic disequilibrium If the 9 genotypic classes are numbered as BBBbbb AAn1n1 n2n2 n3n3 Aan4n4 n5n5 n6n6 aan7n7 n8n8 n9n9 then an (ML) estimator for Δ AB is:
Linkage disequilibrium Genotypic disequilibrium The chi-square test statistics for LD is Note the explicit way in which departures from HW are Included in this expresion.
Δ 2 represents the statistical correlation between two sites, and takes value 1 if only two haplotypes are present. It is arguably the most relevant measure for association between susceptibility loci and SNPs. For example, suppose SNP1 is involved in disease susceptibility, but we genotype cases and controls at a nearby site SNP2. Then, to achieve the same power to detect associations at SNP2 as we would have at SNP1, we need to increase our sample size by a factor of 1/ Δ 2.
These measures are defined for pairs of sites, but for some applications we might instead want to measure how strong LD is across an entire region that contains many polymorphic sites — for example, for testing whether the strength of LD differs significantly among loci or across populations, or whether there is more or less LD in a region than predicted under a particular model. Measuring LD across a region is not straightforward, but one approach is to use the measure ρ, which measures how much recombination would be required under a particular population model to generate the LD that is seen in the data. The development of methods for estimating is now an active research. This type of method can potentially also provide a statistically rigorous approach to the problem of determining whether LD data provide evidence for the presence of hotspots.