Presentation on theme: "What is an association study? Define linkage disequilibrium"— Presentation transcript:
1What is an association study? Define linkage disequilibrium Miranda DurkieJanuary 2010
2What is an association study? Association is a statistical measure of the co-occurrence of certain phenotypic traits with certain alleles.An association study is an examination of genetic variation across a given genome, designed to identify genetic associations with observable traits.
3How does association occur? Direct causation: having allele A makes you susceptible to disease D. Possession of A may not be sufficient in itself to give you D but it makes it more likely you’ll develop D.Natural selection: people who have disease D may be more likely to survive and reproduce if they have allele A.Population stratification: the population contains several distinct genetic subsets and both disease D and allele A both happen to be more common in one particular subset.Type 1 error: association studies test a large number of markers to find significant associations (p < 0.05). However by chance 5% of results will be significant at p = 0.05 and 1% at p = Therefore data needs correction and in the past this was not done adequately so results could not be replicated.Linkage disequilibrium: aim of association studies is to discover associations caused by linkage disequilibrium of allele A and disease D.
4LinkageLinkage analysis is used to track the inheritance of alleles within a family.Linked markers or alleles are only separated if a recombination event occurs.The closer a marker is it to disease/susceptibility allele the less likely it is to be separated by recombination over several generations. This leads to a common haplotype which occurs more often than would be expected by chance.Within an individual family this linkage will extend up to 20cM but for association studies only few kbLinkage disequilibrium is the non-random association between two or more alleles located together on the same chromosome.
5Linkage disequilibrium 2 markers with alleles Aa and BbFrequency of allele A=p and a=1-pFrequency of allele B=q and b=1-qIf there is no association then AB occurs at frequency pqHowever if frequency of AB>pq then AB must be in postive LD.
6Association vs linkage studies Linkage is the relationship between alleles, whilst association is the relationship between alleles and phenotypes.Association studies do not study families but instead look for differences in allele frequencies between different groups of individuals with defined phenotypes.For both studies, the disease-causing mutation and/or susceptibility allele does not need to be known. Instead SNPs or other markers such as di-, tri- or tetra-nucleotide repeats which are in linkage disequilibrium with the disease/susceptibility allele are used.
7Designing an association study Identify SNPs to analyseGenotype all SNPs in subset of the samplesIdentify tagSNPsGenotype tagSNPs in all samplesAnalyse data
81. Identify SNPs to analyse Work out region of interest, or choose regions of known homology from a mouse or other animal model.Work out size of area you wish to study is e.g. choose a 1Mb region around your locus of interest and choose one SNP every 500bp.If possible include SNPs that have been validated in the same ethnic group as the one you are studying.Prioritise SNPs with higher polymorphic frequencies (>10%)
9Identify SNPs cont.If looking within genes prioritise possible functional variants e.g. non-synonymous SNPs within exonsRead current literature to find if out if any of the SNPs have been associated with similar phenotypes in other studiesEnsure that there are no SNPs under the primer or probe binding sites which could lead to non-amplification of one allele and skew your resultsDue to advances in technology majority of current association studies now look at whole genome = genome-wide association studies (GWAS)
102. Genotype subset of samples Ensure cases and controls are ethnically matchedEnsure methodology is robust, accurate and high-throughput e.g. SNParrays - which one? Exonic only? Platform? Cost? No of SNPs?Genotype at least 96 controls and if you wish 96 casesRecord the genotypes conservatively i.e. if unsure mark as unknownAnalyse the data toCheck for deviation from Hardy-Weinberg equilibrium for all alleles - if a deviation is found it is likely that genotyping errors have been made so re-checkCalculate LD scores for SNPs in the regionIdentify tagSNPs (also called haplotype tagging or htSNPs)
113. Identify tagSNPs Over 10 million SNPs in human genome Linked SNPs are often inherited together as a block and the genotypes of these SNPs can be used to generate a haplotype.The key SNPs that uniquely define the haplotype are called tagSNPs or haplotype tagging SNPsHapMap project started in 2002 and was international collaboration to describe common patterns of genetic variation between individualsIdentified around 500,000 key tagSNPs which can be used to generate inferred haplotypes of surrounding SNPsThis has made genome-wide scans more efficient and comprehensive.
124. Genotype tagSNPs in all samples Commercially available SNP arrays have been designed by several companies e.g. Affymetrix and Illumina to cover hundreds of thousands of SNPs across the whole genome.They can have slightly different target SNPs e.g. Illumina Human-1 focuses on exonic SNPs thus concentrating on potential functional variants.These arrays use tagSNPs to maximise the amount of data generated by as few SNPs as possible.In recognition of the potential role of CNVs in complex disease susceptibility many arrays also study CNVs.
13How many samples?Must ensure sufficient cases and controls are tested to reach statistical significanceThe lower the odds ratio for an increase in susceptibility, the more samples are required for the testing to reach statistical significance.It is estimated that common susceptibility loci are likely to have odds ratios (OR) of 1.1 to 1.5.Therefore, for example, in order to achieve 90% power to detect an allele with 0.2 frequency and an OR of 1.2, more than 6000 affected cases and more than double that number of normal controls are required.If the frequency of the variant is only 0.05 you would need 20,000 cases.
145. Analyse dataDo single-point analysis first by looking at individuals SNPs and calculating 2 and odds ratios.Need to apply a correction for multiple testing e.g. Bonferroni correction is conservative correction used for studying multiple alleles that are in LD with each other (non-independent tests)Once you have tested each individual SNP for association you can then construct haplotypes and study them for association with the disease/traitUse bioinformatics programs such as HelixTree, SNPHAP and StataBecause of the problems with sample size for detecting low susceptibility traits, meta-analysis has been increasingly used. Meta-analysis of GWA datasets can increase the power to detect association signals by increasing sample size and by examining more variants throughout the genome than each dataset alone.
15Real examples 12007 Wellcome Trust published GWA study looking at 2,000 cases of seven common diseases and 3,000 shared controls.Found 24 associations: 1 in bipolar disorder, 1 in coronary artery disease, 9 in Crohn's disease, 3 in rheumatoid arthritis, 7 in type 1 diabetes and 3 in type 2 diabetes.Linked 10 genes to common disorders not previously knownColorectal cancer GWA has found 10 associated SNPs, 5 of which are linked to TGFβ superfamily signalling pathway
16Real examples 2GWA studies have led to the discovery of at least 24 loci linked to type 2 diabetesMainly linked to insulin secretion pathway rather than insulin resistanceHowever it is estimated that these loci only account for 5% of the factors contributing to heritability of T2DStudies of hundreds of thousands or even thousands of thousands of individual required to identify low susceptibility allelesCNVs associations found linked to schizophrenia, alzheimers and parkinsons
17Future of GWAStudy of gene-gene and gene-environment interactions crucial which may be missed by single-point GWAMajority of associated variants will not be functional therefore work will be required to identify causal variantsSNPs account for 78% variation in genome but only 26% of total nucleotide differencesFurther study of CNVs will be crucialStudy of rare rather than common variants (1000G)Study of regulatory variantsNext generation sequencing