Presentation on theme: "Population structure. Population structure in case-control studies Population consists of underlying subpopulations. Disease prevalence different between."— Presentation transcript:
Population structure in case-control studies Population consists of underlying subpopulations. Disease prevalence different between subpopulations. Cases preferentially ascertained from specific subpopulations. False positive evidence of association will occur at genetic markers that differ in genotype frequencies between the subpopulations. Traditionally, human geneticists have been sceptical of case- control studies for this reason. CASES CONTROLS
Example Population consists of two equally frequent isolated sub- populations. In the population overall: Pr(Disease & MM) < Pr(Disease) Pr(MM) If we ascertain individuals without regard to subpopulation, cases tend to be selected from subpopulation 1, which has a low frequency of the MM marker genotype. Subpopulation 1Subpopulation 2Overall Frequency0.5 1 Pr(MM) Pr(Disease) Pr(Disease & MM)0.09
Variation across populations PopulationCancer prevalence per 100,000 indivduals BreastLungProstate White hispanic White non-hispanic Black Asian or Pacific islander American indian583354
Matching One solution to the problem is to allow for structure at the design stage, by matching cases and controls for ethnic group. When a case is selected from a given ethnic group, a matched control is selected from the same group. Matched case-control studies require a matched analysis. However, there may be fine-scale structure within ethnic groups or population admixture that cannot be accounted for by matching. Apparent association between SNPs and type 2 diabetes in Pima Indians. Type 2 diabetes occurs with greater prevalence in Caucasian individuals. Association due to population admixture: cases tended to have a greater proportion of Caucasian ancestry, and allele frequencies vary between the ancestral populations.
Solutions to the problem We can eliminate the problem of population structure by collecting family data. Family-based association designs ascertain affected cases and their parents. Form “internal” controls from alleles not transmitted from the parents to the child, effectively matching for ancestry. Less powerful since two parents are required to form a single matched control. Parental data may not always be available, e.g. for late-age onset diseases. For unrelated samples of cases and controls, we can make use of genotype data across the genome to make inferences about and/or adjust for population ancestry. In the presence of structure, there will be many more (false) positive signals of association than we would expect by chance.
Genomic control Devlin and Roeder (1999) used theoretical arguments to propose that with population structure, the distribution of Cochran-Armitage trend tests, genome-wide, is inflated by a constant multiplicative factor λ. We can estimate the multiplicative inflation factor using the statistic λ = median(X i 2 )/ Inflation factor λ > 1 indicates population structure and/or genotyping error. We can carry out an adjusted test of association that takes account of any mismatching of cases/controls at any SNP using the statistic X i 2 / λ. Inflation factor λ = 1.11 Population outliers and/or structure? True hits?
Comments Advantages. Easy to implement genomic control in whole genome association studies. Requires relatively small numbers of markers (minimum of around 50 SNPs). Can be extended to the analysis of quantitative traits and adapted to more genotypic association tests. Disadvantages. Limited to relatively simple tests of association, and is less robust to haplotype tests, for example. There will be a loss in power if there are different genetic effects acting in the different subpopulations.
Multivariate techniques Principal components analysis (PCA) has become a standard tool in genetics to study geographic variation in allele frequencies. PCA is used to infer continuous axes of genetic variation (eigenvectors) that reduce the data to a small number of dimensions, whilst describing as much of the variability between individuals as possible. We can make use of PCA in GWA studies to: identify “population outliers” using genotype data available from the HapMap project; generate axes of genetic variation to account for structure within the study population.
Population outliers The international HapMap project provides high-density genotype data for three reference populations: 30 CEPH trios from Utah with Northern European ancestry (CEU); 30 Yoruba trios from Ibadan, Nigeria (YRI); 45 unrelated Japanese individuals from Tokyo (JPT) and 45 unrelated Han Chinese individuals from Beijing (CHB). HapMap samples can be used to define two axes of genetic variation that broadly distinguish populations of European, African and Asian ancestry. Perform PCA with genotype data from GWA study combined with that from reference HapMap samples at same SNPs. Exclude population outliers from association analysis.
Example: UK WTCCC1 Afro-Caribbean samples South Asian samples QC filtered samples genotyped at ~400K clean SNPs
Structure within populations The same PCA techniques can be applied to genotype data from GWA study without using HapMap samples as reference. Axes of genetic variation can be used to investigate “finer-scale” structure within the study population. Are axes of genetic variation associated with disease phenotype? May reflect fine-scale structure confounded with disease that could inflate genotype-phenotype association statistics. Axes of genetic variation can be used as covariates within logistic regression modelling framework to adjust for underlying population structure.
Example: European population structure Novembre et al. (2008). 1,387 samples ~200K SNPs
Software Standard statistical software, such as R, can be used to perform PCA on genetic data. Patterson et al. (2006) have developed the EIGENSOFT suite of software packages that use PCA to identify population structure in large scale data sets with hundreds of thousands of genetic markers and can allow for LD between loci. SMARTPCA software can be used to perform PCA analysis and can: generate any number of axes of genetic variation; remove outliers on the basis of deviation along axes of genetic variation; test for association between each axis of genetic variation and disease to determine which may be confounded. Multi-dimensional scaling (MDS), a related multivariate statistical technique, can also be used to estimate axes of genetic variation in PLINK.
Analysis workflow Perform genome-wide trend tests of association. Produce QQ plot and calculate the genomic control inflation factor. Evidence of structure? PCA of GWA and HapMap samples Plot samples on first two axes of genetic variation and identify any population outliers. Repeat genome-wide trend tests of association excluding population outliers. Produce QQ plot and calculate genomic control inflation factor. Evidence of structure? PCA of GWA samples (excluding population outliers). Visual inspection of axes of genetic variation and identification of those associated with disease. Repeat genome-wide trend tests of association excluding population outliers, adjusting for axes of genetic variation as covariates. YES
Example: African WTCCC1 Whole genome association study of tuberculosis in the Gambia: part of the WTCCC. Axes of genetic variation calculated using PCA applied to ~100,000 (independent) SNPs genome-wide. Four common ethnic groups separated by first three components of MDS. Inclusion of these components as covariates reduces genomic control statistic from 1.13 (no adjustment) to 1.05 (three components).
Comments Advantages. Multivariate techniques are computationally efficient and can be applied in the context of whole genome association studies. The axes of variation can be interpreted in terms of population structure, and with large numbers of SNPs can clearly differentiate between even relatively “similar” subpopulations and admixed groups. Disadvantages. Some care is needed in interpretation of the eigenvectors (for example may indicate extended regions of LD, rather than population structure).
EMMAX Flexible variance component approach to correct for a wide range of sample structures by explicitly accounting for pair-wise relatedness between individuals, using high-density SNPs. Makes use of a linear mixed model with an empirically estimated relatedness matrix to model the correlation between phenotypes of sample subjects. Can account for “structure” on the scale of ethnic groups, populations from same ancestry group, within populations, and cryptic relatedness.
Example: NFBC66 HEIGHT LDL
Comments EMMAX results are close to uncorrected results when there is minimal evidence of inflation from genomic control. EMMAX results are close to those corrected for principal components as the extent of inflation increases. Has advantage over genomic control of correcting for population structure for each SNP independently. Requires estimation of kinship matrix: many methods available, but may differ in their suitability for trans-ethnic differences or close relationships.
Summary Population structure can lead to spurious associations if disease prevalence and allele frequencies vary between subpopulations. We can use information from markers scattered throughout the genome to test for the presence of structure, identify groups of individuals with similar ancestry, and to correct association tests for mismatching of cases and controls. The genomic control inflation factor can be used as an indicator of the presence of population structure. PCA can be calculate axes of genetic variation that maximise the variability between individuals. Plotting axes of genetic variation from PCA including HapMap samples can be used to identify population outliers. Axes of genetic variation can be used as covariates in the association analysis to adjust for the effects of population structure.