Presentation on theme: "Design and Analysis of Genome-Wide Association Studies"— Presentation transcript:
1 Design and Analysis of Genome-Wide Association Studies Workshop on Statistical Genetics and GenomicsFebruary 12, 2009granuLOmatousTasha E. FingerlinDepartments of Epidemiology and Biostatistics & InformaticsColorado School of Public HealthUniversity of Colorado Denver
2 Today Goal of a genetic association study Rationale for genome-wide association studiesDesign and analysis considerations for GWAsApplication to two clinically similar granulomatous lung diseasesThroughout, focus on tying vocabulary and concepts to those very familiar to epidemiology
3 Complex Traits - Multifactorial Inheritance ExamplesSome cancers - SchizophreniaType 1 diabetes - Cleft lip/palateType 2 diabetes - HypertensionAlzheimer disease - Rheumatoid arthritisInflammatory bowel disease - AsthmaGeneticVariantsTraitTraitNon-genetic factorsThe focus of genome-wide association studies has typically been on complex genetic diseases.Do not follow simple Mendelian mode of inheritanceCharacterized by:Multiple susceptibility lociIncomplete penetrancePhenocopies, heterogeneityEnvironmental risk factorsMany of public health importanceRisk-conferring genetic factors
4 Genetic Association Studies Short-term Goal: Identify genetic variants that explain differences in phenotype among individuals in a study populationQualitative: disease status, presence/absence of congenital defectQuantitative: blood glucose levels, % body fatIf association found, then further study can follow toUnderstand mechanism of action and disease etiology in individualsCharacterize relevance and/or impact in more general populationLong-term goal: to inform process of identifying and delivering better prevention and treatment strategiesLong-term goals realized in several ways: may simply gain insight into pathways relevant for etiology up to being able to tailor preventions/treatments for certain genetic groups
5 DNA Variation>99.9 % of the sequence is identical between any two chromosomes.- Compare maternal and paternal chromosome 1 in single person- Compare Y chromosomes between two unrelated malesEven though most of the sequence is identical between two chromosomes, since the genome sequence is so long (~3 billion base pairs), there are still many variations.Some DNA variations are responsible for biological changes, others have no known function.Alleles are the alternative forms of a DNA segment at a given genetic location.Genetic polymorphism: DNA segment with 2 common alleles.Here, “common” is defined as having frequency 1% in the population, which obviously depends on the population being studied.
6 Single Nucleotide Polymorphisms: SNPs SNPs – DNA sequence variations that occur when a single nucleotide is alteredAlleles at this SNP are “G” and “T”SNPs are the most common form of variation in the human genomeSNPs catalogued in several databasesATGC
7 Genotypes and Haplotypes Genotype: pair of alleles (one paternal, one maternal) at a locusGenotype for this individual is GTHaplotype: sequence of alleles along a single chromosomeGenotypes for this individual (vertical) : CA and TTHaplotypes (horizontal): CT and ATATGCMaternalPaternalATGCMaternalPaternalNote that we usually don’t spell out the alleles at non-polymorphic sites.
8 Scope of a Genetic Association Study Candidate geneKnown functional variantsVariants with unknown function in exons, introns, regulatory regionsLinkage candidate regionFunctional variants, or those with unknown function in candidate genesMore general coverage of region using many markersGenome-wideTest for association with hundreds of thousands (millions) of SNPs spread across the entire genome.Many design strategies possible for distributing markers* Sabeti PC et al. (2002). Nature 419:Number and placement of genetic markers depends on extent of LD in all 3 cases -
9 Genome-Wide Association Studies Rationale:Linkage analysis using families takes unbiased look at whole genome, but is underpowered for the size of genetic effects we expect to see for many complex genetic traits.Candidate gene association studies have greater power to identify smaller genetic effects, but rely on a priori knowledge about disease etiology.Genome-wide association studies combine the genomic coverage of linkage analysis with the power of association to have much better chance of finding complex trait susceptibility variants.Linkage analysis in families, where we track diseases in families, …
10 Why are They Possible Now? Genotyping Technology:Now have ability to type hundreds of thousands (or millions) of SNPs in one reaction on a “SNP chip.” The cost can be as low as $200-$300 per person.Two primary platforms: Affymetrix and Illumina.Design and analysis:Availability of SNP databases, HapMap, and other resources to identify the SNPs and design SNP chips.Faster computers to carry out the millions of calculations make implementation possible.Genome-wide association studies are possible because of parallel advances in two areas
11 Design and Analysis Strategies: Moving Target A genetic factor is like any other potential risk factor and the same study design and analysis principles hold – in addition to those specific to GWAs.Standard case-control (matched or unmatched), cohort-based quantitative trait and longitudinal designs are common.In what follows, I will talk about current ideas and methods, with a focus on assumptions and quality control.Focus today is on case-control design, but many of the principles apply to other designs.
12 SNP Chips: Number and Placement of SNPs A “typical” SNP chip has at least 317,000 SNPs distributed across the genome. Newest: ~1 million.The newest chips can also measure (directly or indirectly) some types of copy number variation.We do not directly measure genotypes at all genetic polymorphisms, but rely on association between the polymorphisms we do assay and those which we do not assay.SNP-SNP association, or linkage disequilibrium, is fundamental to our ability to sample the whole genome with relatively few SNPs.
13 Linkage Disequilibrium (LD) Linkage disequilibrium: the non-random association of alleles at linked loci.A measure of the tendency of some alleles to be inherited together on haplotypes descended from ancestral chromosomes.If these where the only two haplotypes in the population, then alleles G and A ( C and T) are in perfect linkage disequilibrium.If we genotype the first SNP, we know what the alleles are at the second SNP.ATGCLD influenced by recombination, gene conversion, mutation, selection, etc.If causal variant is at A and we tested B, B would be associated with BThe first SNP perfectly “tags” second SNP
14 In general, LD between two SNPs decreases with physical distance Extent of LD varies greatly depending on region of genomeIf LD strong, need fewer SNPs to capture variation in a region
16 HapMapMulti-country effort to identify, catalog common human genetic variants.Developed to better understand and catalogue LD patterns across the genome in several populations.Genotyped ~4 million SNPs on samples of African, east Asian, European ancestry.All genotype data in a publicly available data base.Can download the genotype dataAble to examine LD patterns across genomeCan estimate approximate coverage of a given SNP chipCan represent 80-90% of common SNPs with~300,000 tag SNPs for European or Asian samples~500,000 tag SNPs for African samples
17 Case and Control Selection Case and control samples may be population-basedCases and controls may be chosen to increase magnitude of contrastCase sample may be selected to be enriched for predisposing variant(s)- Family history- Early age of onset- Increased severity of diseaseControl sample may be selected to be “very healthy” or “super controls”- E.g. for type 2 diabetes, may select individuals who have normal response to glucose at age 70- Control selection just as important (and tricky) as for any case-control study.
18 Testing for Genetic Association with Disease Question of interest: Are the alleles or genotypes at a genetic marker associated with disease status?Use usual statistical machinery get estimates of measures of association and to test for association for each of the SNPs.One typical approach: Test for association between having 0, 1 or 2 copies of rare allele at a SNP using Cochran-Armitage test for trend.Note: may use logistic regression to adjust for other known risk factors, etc.Need to verify they used additive test – just need a figure like this to show what you end up with. This one from Rheumatoid ArthritisNote that using traditional thresholds not appropriate. Will deal with that in a moment.Pearson, T. A. et al. JAMA 2008;299:
19 Interpreting the Statistical Results Testing for association at each of hundreds of thousands of markers dictates that traditional statistical significance thresholds (e.g. =.05) not appropriate.That aside (more in a few minutes), if you identify a SNP that is significantly associated with disease, there are three possibilities:There is a causal relationship between SNP and diseaseThe marker is in linkage disequilibrium with a causal locusFalse positiveMany potential sources of systematic errors that might lead to false positive results.Genotyping quality control issues particularly important.
20 Confounding by Ancestry (a.k.a. Population Stratification)Control selection critical as alwaysConfounding by ancestry: Distortion of the relationship between the genetic risk factor and the outcome of interest due to ancestry that is related to both the frequency of the putative genetic risk factor and whether or not subject is a case or a control.AncestryNote that some now refer to population stratification as the presence of substructure, whether or not there exists a difference in disease risk between the subgroups. Traditionally referred to presence of confounding.Also note that emphasis has been on potential for confounding that leads to increase of type I error, but negative confounding also possible as always.Genetic Risk Factor Case/Control Status
21 Population Stratification Distribution of genotypes differs between cases and controlsMight conclude that allele A (or genotype AA) related to diseaseCases ControlsGenotypeTTATAA
22 Population Stratification If cases and controls not well-matched ancestrallyUnequal distribution of non-disease-related alleles between cases and controlsAny allele more common in population with increased risk of disease may appear to be associated with diseaseCases Controls GenotypeTTATAAPop Pop 1Pop Pop 2
23 Population Stratification Unequal distribution of alleles may result fromSample made up of more than one distinct populationSample made up of individuals with differing levels of admixtureParra et al. AJHG 63:1839, 1998
24 Using the GWA Data to Avoid Population Stratification Several options exist to allow controlling for ancestry using markers across the genomeAll based on idea that stratification should exist across the genome, and that we can use the information on the genome-wide markers to- estimate ancestry groups, remove extreme outliers, control for other variation- estimate inflation of test statistic and adjust all test statisticsIn each case, assumes constant effect of ancestry, which may or may not be appropriateBottom line is that with genome of data, can do a very good job of understanding potential for and minimizing impact of population substructure.
25 Potential Solutions to Multiple Testing Issue Bonferroni correctionAssume all tests performed are independentEstimate number of independent polymorphisms in genomeThreshold often considered appropriate: 5x10-8.Other less conservative allocation of experiment-wide over the genomePerhaps spend more on linkage regions or for SNPs in coding regions of genePermutationImplementation for case-control study: permute case and control status, perform all tests record the most significant p-value among those tests and then re-permute case-control status and test again. Repeat many times.P-value for most significant test is the proportion of permutations that had a “best” p-value as small or smaller than the one you observe with the observed data (the data with the right case and control labels).
26 Q-Q PlotIf points deviate (significantly?) from line of equality indicate that the two distributions are different.Some will take point at which the observed p-values differ from the expected as the point to declare statistical significance.Important points:Can have deviation from line that is indicative of violated assumptions (e.g. existence of population stratification)In tails of distribution, have less information, and so might require large divergence from expectedFigure from: G. Abecasis
27 Using Multiple Samples Rationale: Given the very large number of tests performed, use multiple samples as a way to reduce the expected number of false positive results at that end of the study.Split-sampleApproach: Rather than testing entire sample on entire genome, test for association with some proportion of your samples and then test some proportion of those markers in the rest of your samples*.Independent samplesApproach: Rather than split your own sample, use another independent sample.In either case, can dramatically reduce number of false positive results while maintaining power.Note that analysis strategy may not be to actually replicate, but do joint analysis with appropriate “hit” for having observed it in first set.Also note the potential use of meta analyses
28 Granulomatous Lung Diseases Chronic Beryllium Disease (CBD)Exposure to beryllium results in formation of granulomas in lung among some individualsSarcoidosisUnknown exposure(s) result in granuloma formation and inflammation in lung, but other organs often involvedthe deep lung - a biopsy that includes air sacs, and blood vessels, and yes those big areas of lots of cells in a circle are granulomas. They contain macrophages/monocytes in the center, with multi nucleated giant cells and T cells in a rim around the outside.
29 HypothesisSarcoidosis and CBD share genetic factors important in their similar granulomatous inflammatory pathwaysCBDSarcoidosisDisease SeverityDisease Risk
31 Top Region for CBD p=10-11Yellow line is the p-value of the top 3000 CBD’s and top 3000 SARC’s
32 Region Shared by both CBD and Sarcoidosis on 8p23.2 p=10-2 – 10-4
33 Other Important Topics of Present and Future ImputationCareful consideration of non-genetic factorsInvestigation of interactions: gene-environment and gene-geneSequencing data
34 National Jewish Health AcknowledgementsWake Forest University University of MichiganCarl D. Langefeld, PhD Michael Boehnke, PhD Goncalo R. Abecasis, PhDNational Jewish HealthLisa Maier, MPH, MDLori Silveira, MS