Presentation is loading. Please wait.

Presentation is loading. Please wait.

Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015.

Similar presentations


Presentation on theme: "Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015."— Presentation transcript:

1 Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

2 Considerations Study Design Quality control: pre-analysis  Samples  Genetic markers Quality control: post-analysis  Q-Q plots Quality Control: meta-analysis Multiple Testing

3 Study Design Is your phenotype genetic (i.e. heritable)? Is it a binary trait? Or quantitative? Are there age differences? Gender differences? Are there important environmental factors to consider?

4 Sample Quality Control Genotyping efficiency Gender discrepancies Relatedness Population stratification (case-control studies) Mendelian errors (families)

5 Sample Quality Control (Gender Checks) Sample Mix-up or Mislabel Possible Sample Contamination Sample Mix-up or Mislabel

6 Sample Quality Control (Relatedness) Calculate the Identity by State mean between pairs and plot the standardized mean and variance using Graphical Relationship Representation (Abecasis et al, Bioinformatics 2001) Unrelated Case-Control Trios

7 Sample Quality Control (Population Stratification) Allele frequency and prevalence differences between groups  Genetic drift  Differential selection  Little migration between subpopulations

8 Sample Quality Control (Population Stratification) EIGENSTRAT (Price et al. Nature Genetics 2006))  Principle Components Analysis (PCA) method ► Applies principle components analysis to genotype data to infer population substructure from genetic data  Principal components can be used as covariates in a regression model to correct for bias caused by substructure

9 Quality Control of Genetic Markers Genotyping efficiency Hardy-Weinberg equilibrium Differential missingness

10 Marker Quality Control: Hardy Weinberg Equilibrium There are two alleles at a given locus, A and a p=freq(A) and q=freq(a) p + q = 1

11 (p + q) (p + q) = p 2 + pq + qp + q 2 = p 2 + 2pq + q 2 AA homozygotes Aa heterozygotes aa homozygotes Marker Quality Control: Hardy Weinberg Equilibrium

12 p 2 = f(AA) 2pq = f(Aa) q 2 = f(aa) Marker Quality Control: Hardy Weinberg Equilibrium

13 Under dominant model  Frequency of affecteds = p 2 +2pq Under a recessive model  Frequency of affecteds = q 2  Frequency of carriers = 2pq Marker Quality Control: Hardy Weinberg Equilibrium

14 Simple χ 2 test Laboratory error May be telling you something  Controls in HWE, Cases not Marker Quality Control: Hardy Weinberg Equilibrium

15 Post Analysis Quality Control: Q-Q plots What is a Q-Q Plot?  “Q” stands for quantile  Used to assess the number and magnitude of observed associations between SNPs and the trait of interest, compared to the association statistics expected under the null hypothesis of no association ► Deviations from the “identity” line True Association Sharp deviations are likely due to Error Also possible due to sample relatedness or population structure  Genomic Inflation Factor (GIF) can be computed to assess deviations ► Ratio of the median observed association statistic to the expected median ► A value of 1 would mean no deviation

16 Post Analysis Quality Control: Q-Q plots

17 Meta-Analyses There can be biases in our data not only within sites but across sites!  Genotyping effects  Genotype calling effects

18 Batch Effects: A Tale of the ImmunoChip ImmunoChip Fine-Mapping Replication 207,728 AS (Ankylosing Spondylitis) CeD (Coeliac Disease) CD (Crohn’s Disease) IgA (IgA Deficiency) MS (Multiple Sclerosis) PBC (Primary Biliary Cirrhosis) PS (Psoriasis) RA (Rheumatoid Arthritis) SLE (Systemic Lupus Erythematosus) T1D (Type 1 Diabetes) UC (Ulcerative Colitis) AITD (Autoimmune Thyroid Disease) WTCCC2 (PD, Bipolar, Reading etc.)

19 A Focus on Multiple Sclerosis StratumCasesControls AUSNZ247944 Belgium3021703 Denmark741835 Finland221486 France386354 Germany25825545 Italy9571255 Norway894674 Sweden21532331 UK43244422 US16915542 TOTAL14,49824,091

20 Genotyping and Genotype Calling Genotyping was done at 5 sites:  John P. Hussman Institute for Human Genomics, University of Miami  Wellcome Trust Sanger Institute  Local sites in France, Germany, and the United States All genotype calling was done at the Wellcome Trust Sanger Institute in 3 batches  Initially used Illuminus and GenoSNP  Final genotype calls made with Opticall

21 Using Illuminus and GenoSNP, autosomal markers were divided into categories of ‘good’, ‘middle’, and ‘bad’ based on the following criteria:  Good: call rate in both was ≥95% and concordance was ≥99% ► Concordant calls were kept  Bad: call rate was <95% in both Illuminus and GenoSNP ► Drop all markers  Middle: marker did not meet Good or Bad criteria ► More detailed analysis was done using 1000 genomes data Initial Marker Quality Control

22 Population substructure, problems related to ‘calling batches’ were discovered. Using a test set of Swedish samples, PCA analysis was done Miami Sanger Initial Test for Population Substructure

23 Investigating the Problem Scatter plot of the first principal component’s loadings (y axis) vs – log10(p-values) from a logistic regression model using the genotypic center as phenotype Scatter plot of the first principal component’s loadings (y axis) vs – log10(p-values) from a test of SNP missing between the 2 genotypic centers Scatter plot of the first principal component’s loadings (y axis) vs –log10(p- values) for deviation from Hardy-Weinberg equilibrium We performed the following comparisons to identify the source of the problem:  Define the genotyping center as phenotype and regress the variants. (A)  Run genotyping missingness for the 2 centers. (B)  Test for deviation of the Hardy-Weinberg equilibrium. (C)

24 Genotypic center as phenotype SNP missingness between centers HWE Investigating the Problem In the next step, we identified all the SNPs with a p-value < 10-3 in every respective test. We removed them and then calculated the new principal components From the above, it is clear that the different genotypic centers is not the culprit, rather it seems to be associated with differences in HWE, which are a proxy for discordant calls between centers

25 Investigating the Problem Example: rs13306196 For this SNP, the Illuminus call was used for both centers. In Miami, a G allele was assigned and in Sanger an A allele was assigned. This means that the cluster assignment was likely reversed between sites. DataA1A2A1A1/A1A2/A2A2 Genotype Counts AllGA1969/0/6866 Miami Illuminus0G0/0/1969 Sanger Illuminus0A0/0/6866

26 GenoSNP Illuminus Illuminus fails to call the same allele even for some mono-allelic markers Investigating the Problem

27 The dichotomy of the first principal component is explained by calling discordances of the Illuminus caller. Probably a bug exists in the Illuminus calling algorithm where there are difficulties in making calls when less than 3 clusters exist. Solution: Re-QC using GenoSNP or Opticall (new) Solution to the Problem

28 Clean GenoSNP/IlluminusOpticall Solution to the Problem Using Opticall, the first principal component no longer splits the data in 2 separate clusters In later analyses, Opticall was determined to have less variation than GenoSNP in genotype frequencies between genotype calling batches

29 Final Assessment of Analysis: GIF 207,728 192,402 161,311 24,388 production Failed QC 20,381 10,710 Monomorphic MAF > 5% 28,406 MAF 0.5-5% 108,517 MAF < 0.5% (Autosomal)

30 Multiple Testing In genetics, there have always been two opposing camps:  Liberals: They don’t worry about it at all. They report nominal P values and aren’t afraid to be wrong.  Conservatives: They worry about it all the time. They report only fully “corrected” P values. Common methods:  Bonferroni  False Discovery Rate


Download ppt "Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015."

Similar presentations


Ads by Google