Population structure.

Slides:



Advertisements
Similar presentations
15 The Genetic Basis of Complex Inheritance
Advertisements

Statistical methods for genetic association studies
Assumptions underlying regression analysis
Imputation for GWAS 6 December 2012.
Confounding from Cryptic Relatedness in Association Studies Benjamin F. Voight (work jointly with JK Pritchard)
Generalized Regional Admixture Mapping (RAM) and Structured Association Testing (SAT) David T. Redden, Associate Professor, Department of Biostatistics,
What is an association study? Define linkage disequilibrium
Gene-by-Environment and Meta-Analysis Eleazar Eskin University of California, Los Angeles.
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
Association Tests for Rare Variants Using Sequence Data
SHI Meng. Abstract The genetic basis of gene expression variation has long been studied with the aim to understand the landscape of regulatory variants,
METHODS FOR HAPLOTYPE RECONSTRUCTION
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Genome-wide association mapping Introduction to theory and methodology
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
EPI293 Design and analysis of gene association studies Winter Term 2008 Lecture 6: Population stratification Peter Kraft
Genome-Wide Association Studies Xiaole Shirley Liu Stat 115/215.
Course Overview Personalized Medicine: Understanding Your Own Genome Fall 2014.
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Factors to Consider in Selecting a Genotyping Platform Elizabeth Pugh June 22, 2007.
ConceptS and Connections
Probabilistic and Statistical Techniques 1 Lecture 24 Eng. Ismail Zakaria El Daour 2010.
Population Stratification
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Input: A set of people with/without a disease (e.g., cancer) Measure a large set of genetic markers for each person (e.g., measurement of DNA at various.
Genome-Wide Association Study (GWAS)
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Polymorphism Haixu Tang School of Informatics. Genome variations underlie phenotypic differences cause inherited diseases.
INTRODUCTION TO ASSOCIATION MAPPING
Recombination based population genomics Jaume Bertranpetit Marta Melé Francesc Calafell Asif Javed Laxmi Parida.
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Chapter 8: Simple Linear Regression Yang Zhenlin.
The International Consortium. The International HapMap Project.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Data Quality Control Suzanne M. Leal Baylor College of Medicine Copyrighted © S.M. Leal 2015.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Understanding Principle Component Approach of Detecting Population Structure Jianzhong Ma PI: Chris Amos.
Principal components analysis
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
Slide Slide 1 Chapter 10 Correlation and Regression 10-1 Overview 10-2 Correlation 10-3 Regression 10-4 Variation and Prediction Intervals 10-5 Multiple.
Population stratification
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Common variation, GWAS & PLINK
Genetic Association Analysis
Principal components analysis
Genome Wide Association Studies using SNP
Itsik Pe’er, Yves R. Chretien, Paul I. W. de Bakker, Jeffrey C
Population stratification
Genome-wide Associations
Genome-wide Association Studies
Why general modeling framework?
What are BLUP? and why they are useful?
Alicia R. Martin, Christopher R. Gignoux, Raymond K
Ida Moltke, Matteo Fumagalli, Thorfinn S. Korneliussen, Jacob E
Japanese Population Structure, Based on SNP Genotypes from 7003 Individuals Compared to Other Ethnic Groups: Effects on Population-Based Association Studies 
Haplotypes When the presence of two or more polymorphisms on a single chromosome is statistically correlated in a population, this is a haplotype Example.
EE, NCKU Tien-Hao Chang (Darby Chang)
Leveraging Multi-ethnic Evidence for Mapping Complex Traits in Minority Populations: An Empirical Bayes Approach  Marc A. Coram, Sophie I. Candille, Qing.
Presentation transcript:

Population structure

Population structure in case-control studies Population consists of underlying subpopulations. Disease prevalence different between subpopulations. Cases preferentially ascertained from specific subpopulations. False positive evidence of association will occur at genetic markers that differ in genotype frequencies between the subpopulations. Traditionally, human geneticists have been sceptical of case-control studies for this reason. CASES CONTROLS

Example Subpopulation 1 Subpopulation 2 Overall Frequency 0.5 1 Pr(MM) 0.1 0.9 Pr(Disease) Pr(Disease & MM) 0.09 Population consists of two equally frequent isolated sub-populations. In the population overall: Pr(Disease & MM) < Pr(Disease) Pr(MM) If we ascertain individuals without regard to subpopulation, cases tend to be selected from subpopulation 1, which has a low frequency of the MM marker genotype.

Variation across populations Cancer prevalence per 100,000 indivduals Breast Lung Prostate White hispanic 93 34 140 White non-hispanic 148 65 163 Black 122 81 272 Asian or Pacific islander 97 43 100 American indian 58 33 54

Matching One solution to the problem is to allow for structure at the design stage, by matching cases and controls for ethnic group. When a case is selected from a given ethnic group, a matched control is selected from the same group. Matched case-control studies require a matched analysis. However, there may be fine-scale structure within ethnic groups or population admixture that cannot be accounted for by matching. Apparent association between SNPs and type 2 diabetes in Pima Indians. Type 2 diabetes occurs with greater prevalence in Caucasian individuals. Association due to population admixture: cases tended to have a greater proportion of Caucasian ancestry, and allele frequencies vary between the ancestral populations.

Solutions to the problem We can eliminate the problem of population structure by collecting family data. Family-based association designs ascertain affected cases and their parents. Form “internal” controls from alleles not transmitted from the parents to the child, effectively matching for ancestry. Less powerful since two parents are required to form a single matched control. Parental data may not always be available, e.g. for late-age onset diseases. For unrelated samples of cases and controls, we can make use of genotype data across the genome to make inferences about and/or adjust for population ancestry. In the presence of structure, there will be many more (false) positive signals of association than we would expect by chance.

Genomic control Devlin and Roeder (1999) used theoretical arguments to propose that with population structure, the distribution of Cochran-Armitage trend tests, genome-wide, is inflated by a constant multiplicative factor λ. We can estimate the multiplicative inflation factor using the statistic λ = median(Xi2)/0.456. Inflation factor λ > 1 indicates population structure and/or genotyping error. We can carry out an adjusted test of association that takes account of any mismatching of cases/controls at any SNP using the statistic Xi2/ λ. True hits? Population outliers and/or structure? Inflation factor λ = 1.11

Comments Advantages. Disadvantages. Easy to implement genomic control in whole genome association studies. Requires relatively small numbers of markers (minimum of around 50 SNPs). Can be extended to the analysis of quantitative traits and adapted to more genotypic association tests. Disadvantages. Limited to relatively simple tests of association, and is less robust to haplotype tests, for example. There will be a loss in power if there are different genetic effects acting in the different subpopulations.

Multivariate techniques Principal components analysis (PCA) has become a standard tool in genetics to study geographic variation in allele frequencies. PCA is used to infer continuous axes of genetic variation (eigenvectors) that reduce the data to a small number of dimensions, whilst describing as much of the variability between individuals as possible. We can make use of PCA in GWA studies to: identify “population outliers” using genotype data available from the HapMap project; generate axes of genetic variation to account for structure within the study population.

Population outliers The international HapMap project provides high-density genotype data for three reference populations: 30 CEPH trios from Utah with Northern European ancestry (CEU); 30 Yoruba trios from Ibadan, Nigeria (YRI); 45 unrelated Japanese individuals from Tokyo (JPT) and 45 unrelated Han Chinese individuals from Beijing (CHB). HapMap samples can be used to define two axes of genetic variation that broadly distinguish populations of European, African and Asian ancestry. Perform PCA with genotype data from GWA study combined with that from reference HapMap samples at same SNPs. Exclude population outliers from association analysis.

Example: UK WTCCC1 Afro-Caribbean samples South Asian samples QC filtered samples genotyped at ~400K clean SNPs

Structure within populations The same PCA techniques can be applied to genotype data from GWA study without using HapMap samples as reference. Axes of genetic variation can be used to investigate “finer-scale” structure within the study population. Are axes of genetic variation associated with disease phenotype? May reflect fine-scale structure confounded with disease that could inflate genotype-phenotype association statistics. Axes of genetic variation can be used as covariates within logistic regression modelling framework to adjust for underlying population structure.

Example: European population structure 1,387 samples ~200K SNPs Novembre et al. (2008).

Software Standard statistical software, such as R, can be used to perform PCA on genetic data. Patterson et al. (2006) have developed the EIGENSOFT suite of software packages that use PCA to identify population structure in large scale data sets with hundreds of thousands of genetic markers and can allow for LD between loci. SMARTPCA software can be used to perform PCA analysis and can: generate any number of axes of genetic variation; remove outliers on the basis of deviation along axes of genetic variation; test for association between each axis of genetic variation and disease to determine which may be confounded. Multi-dimensional scaling (MDS), a related multivariate statistical technique, can also be used to estimate axes of genetic variation in PLINK.

Analysis workflow Perform genome-wide trend tests of association. Produce QQ plot and calculate the genomic control inflation factor. Evidence of structure? YES PCA of GWA and HapMap samples Plot samples on first two axes of genetic variation and identify any population outliers. Repeat genome-wide trend tests of association excluding population outliers. Produce QQ plot and calculate genomic control inflation factor. Evidence of structure? YES PCA of GWA samples (excluding population outliers). Visual inspection of axes of genetic variation and identification of those associated with disease. Repeat genome-wide trend tests of association excluding population outliers, adjusting for axes of genetic variation as covariates.

Example: African WTCCC1 Whole genome association study of tuberculosis in the Gambia: part of the WTCCC. Axes of genetic variation calculated using PCA applied to ~100,000 (independent) SNPs genome-wide. Four common ethnic groups separated by first three components of MDS. Inclusion of these components as covariates reduces genomic control statistic from 1.13 (no adjustment) to 1.05 (three components).

Comments Advantages. Disadvantages. Multivariate techniques are computationally efficient and can be applied in the context of whole genome association studies. The axes of variation can be interpreted in terms of population structure, and with large numbers of SNPs can clearly differentiate between even relatively “similar” subpopulations and admixed groups. Disadvantages. Some care is needed in interpretation of the eigenvectors (for example may indicate extended regions of LD, rather than population structure).

EMMAX Flexible variance component approach to correct for a wide range of sample structures by explicitly accounting for pair-wise relatedness between individuals, using high-density SNPs. Makes use of a linear mixed model with an empirically estimated relatedness matrix to model the correlation between phenotypes of sample subjects. Can account for “structure” on the scale of ethnic groups, populations from same ancestry group, within populations, and cryptic relatedness.

Example: NFBC66 HEIGHT LDL

Comments EMMAX results are close to uncorrected results when there is minimal evidence of inflation from genomic control. EMMAX results are close to those corrected for principal components as the extent of inflation increases. Has advantage over genomic control of correcting for population structure for each SNP independently. Requires estimation of kinship matrix: many methods available, but may differ in their suitability for trans-ethnic differences or close relationships.

Summary Population structure can lead to spurious associations if disease prevalence and allele frequencies vary between subpopulations. We can use information from markers scattered throughout the genome to test for the presence of structure, identify groups of individuals with similar ancestry, and to correct association tests for mismatching of cases and controls. The genomic control inflation factor can be used as an indicator of the presence of population structure. PCA can be calculate axes of genetic variation that maximise the variability between individuals. Plotting axes of genetic variation from PCA including HapMap samples can be used to identify population outliers. Axes of genetic variation can be used as covariates in the association analysis to adjust for the effects of population structure.