Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Population Genetics 1 Chapter 23 in Purves 7 th edition, or more detail in Chapter 15 of Genetics by Hartl & Jones (in library) Evolution is a change in.
What is an association study? Define linkage disequilibrium
Association Tests for Rare Variants Using Sequence Data
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Genetic Analysis in Human Disease
 Place: New York State  Population: 10,000,000  Character of the disease: caused by the recessive allele and the recessive homozygous genotype is 100%
Hardy-Weinberg Equilibrium
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Population Genetics. Mendelain populations and the gene pool Inheritance and maintenance of alleles and genes within a population of randomly breeding.
Hardy Weinberg: Population Genetics
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
Population Genetics What is population genetics?
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
Variation.
Hardy-Weinberg Equation Measuring Evolution of Populations
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Analysis of genome-wide association studies
Geuvadis RNAseq analysis at UNIGE Analysis plans
Systematic reviews of genetic association studies Robert Walton Fiona Fong 15 March 2013.
Broad-Sense Heritability Index
Process of Genetic Epidemiology Migrant Studies Familial AggregationSegregation Association StudiesLinkage Analysis Fine Mapping Cloning Defining the Phenotype.
Chapter 7 Population Genetics. Introduction Genes act on individuals and flow through families. The forces that determine gene frequencies act at the.
Population Stratification
Population Genetics: Chapter 3 Epidemiology 217 January 16, 2011.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
Epidemiology 719 Quantitative methods in genetic epidemiology Bhramar Mukherjee and Sebastian Zoellner
Evolution as Genetic Change in Populations. Learning Objectives  Explain how natural selection affects single-gene and polygenic traits.  Describe genetic.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Risk Prediction of Complex Disease David Evans. Genetic Testing and Personalized Medicine Is this possible also in complex diseases? Predictive testing.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Statistical Issues in Genetic Association Studies
Mechanisms of Evolution  Lesson goals:  1. Define evolution in terms of genetics.  2. Using mathematics show how evolution cannot occur unless there.
PLINK / Haploview Whole genome association software tutorial
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Godfrey Hardy ( ) Wilhelm Weinberg ( ) Hardy-Weinberg Principle p + q = 1 Allele frequencies, assuming 2 alleles, one dominant over the.
Data Quality Control Suzanne M. Leal Baylor College of Medicine Copyrighted © S.M. Leal 2015.
Measuring Evolution of Populations. 5 Agents of evolutionary change MutationGene Flow Genetic Drift Natural Selection Non-random mating.
Principal components analysis
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
Genome-Wides Association Studies (GWAS) Veryan Codd.
POINT > Define Hardy-Weinberg Equilibrium POINT > Use Hardy-Weinberg to determine allele frequencies POINT > Define “heterozygous advantage” POINT > Describe.
Population stratification
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Hardy Weinberg Equilibrium. What is Hardy- Weinberg? A population is in Hardy-Weinberg equilibrium if the genotype frequencies are the same in each generation.
Quantitative genetics
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Power Calculations for GWAS
Quality Control Using EasyQC & Meta-Analysis in METAL
Quality control for GWAS
Genome Wide Association Studies using SNP
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
Autoimmune liver disease, autoimmunity and liver transplantation
Preparing data for GWAS analysis
Population Genetics & Hardy - Weinberg
Exercise: Effect of the IL6R gene on IL-6R concentration
Lecture: Natural Selection and Genetic Drift and Genetic Equilibrium
Association Analysis Spotted history
Hardy-Weinberg Equilibrium Model
Simultaneous Genotype Calling and Haplotype Phasing Improves Genotype Accuracy and Reduces False-Positive Associations for Genome-wide Association Studies 
Hardy Weinberg.
Presentation transcript:

Practical Considerations in Statistical Genetics Ashley Beecham June 19, 2015

Considerations Study Design Quality control: pre-analysis  Samples  Genetic markers Quality control: post-analysis  Q-Q plots Quality Control: meta-analysis Multiple Testing

Study Design Is your phenotype genetic (i.e. heritable)? Is it a binary trait? Or quantitative? Are there age differences? Gender differences? Are there important environmental factors to consider?

Sample Quality Control Genotyping efficiency Gender discrepancies Relatedness Population stratification (case-control studies) Mendelian errors (families)

Sample Quality Control (Gender Checks) Sample Mix-up or Mislabel Possible Sample Contamination Sample Mix-up or Mislabel

Sample Quality Control (Relatedness) Calculate the Identity by State mean between pairs and plot the standardized mean and variance using Graphical Relationship Representation (Abecasis et al, Bioinformatics 2001) Unrelated Case-Control Trios

Sample Quality Control (Population Stratification) Allele frequency and prevalence differences between groups  Genetic drift  Differential selection  Little migration between subpopulations

Sample Quality Control (Population Stratification) EIGENSTRAT (Price et al. Nature Genetics 2006))  Principle Components Analysis (PCA) method ► Applies principle components analysis to genotype data to infer population substructure from genetic data  Principal components can be used as covariates in a regression model to correct for bias caused by substructure

Quality Control of Genetic Markers Genotyping efficiency Hardy-Weinberg equilibrium Differential missingness

Marker Quality Control: Hardy Weinberg Equilibrium There are two alleles at a given locus, A and a p=freq(A) and q=freq(a) p + q = 1

(p + q) (p + q) = p 2 + pq + qp + q 2 = p 2 + 2pq + q 2 AA homozygotes Aa heterozygotes aa homozygotes Marker Quality Control: Hardy Weinberg Equilibrium

p 2 = f(AA) 2pq = f(Aa) q 2 = f(aa) Marker Quality Control: Hardy Weinberg Equilibrium

Under dominant model  Frequency of affecteds = p 2 +2pq Under a recessive model  Frequency of affecteds = q 2  Frequency of carriers = 2pq Marker Quality Control: Hardy Weinberg Equilibrium

Simple χ 2 test Laboratory error May be telling you something  Controls in HWE, Cases not Marker Quality Control: Hardy Weinberg Equilibrium

Post Analysis Quality Control: Q-Q plots What is a Q-Q Plot?  “Q” stands for quantile  Used to assess the number and magnitude of observed associations between SNPs and the trait of interest, compared to the association statistics expected under the null hypothesis of no association ► Deviations from the “identity” line True Association Sharp deviations are likely due to Error Also possible due to sample relatedness or population structure  Genomic Inflation Factor (GIF) can be computed to assess deviations ► Ratio of the median observed association statistic to the expected median ► A value of 1 would mean no deviation

Post Analysis Quality Control: Q-Q plots

Meta-Analyses There can be biases in our data not only within sites but across sites!  Genotyping effects  Genotype calling effects

Batch Effects: A Tale of the ImmunoChip ImmunoChip Fine-Mapping Replication 207,728 AS (Ankylosing Spondylitis) CeD (Coeliac Disease) CD (Crohn’s Disease) IgA (IgA Deficiency) MS (Multiple Sclerosis) PBC (Primary Biliary Cirrhosis) PS (Psoriasis) RA (Rheumatoid Arthritis) SLE (Systemic Lupus Erythematosus) T1D (Type 1 Diabetes) UC (Ulcerative Colitis) AITD (Autoimmune Thyroid Disease) WTCCC2 (PD, Bipolar, Reading etc.)

A Focus on Multiple Sclerosis StratumCasesControls AUSNZ Belgium Denmark Finland France Germany Italy Norway Sweden UK US TOTAL14,49824,091

Genotyping and Genotype Calling Genotyping was done at 5 sites:  John P. Hussman Institute for Human Genomics, University of Miami  Wellcome Trust Sanger Institute  Local sites in France, Germany, and the United States All genotype calling was done at the Wellcome Trust Sanger Institute in 3 batches  Initially used Illuminus and GenoSNP  Final genotype calls made with Opticall

Using Illuminus and GenoSNP, autosomal markers were divided into categories of ‘good’, ‘middle’, and ‘bad’ based on the following criteria:  Good: call rate in both was ≥95% and concordance was ≥99% ► Concordant calls were kept  Bad: call rate was <95% in both Illuminus and GenoSNP ► Drop all markers  Middle: marker did not meet Good or Bad criteria ► More detailed analysis was done using 1000 genomes data Initial Marker Quality Control

Population substructure, problems related to ‘calling batches’ were discovered. Using a test set of Swedish samples, PCA analysis was done Miami Sanger Initial Test for Population Substructure

Investigating the Problem Scatter plot of the first principal component’s loadings (y axis) vs – log10(p-values) from a logistic regression model using the genotypic center as phenotype Scatter plot of the first principal component’s loadings (y axis) vs – log10(p-values) from a test of SNP missing between the 2 genotypic centers Scatter plot of the first principal component’s loadings (y axis) vs –log10(p- values) for deviation from Hardy-Weinberg equilibrium We performed the following comparisons to identify the source of the problem:  Define the genotyping center as phenotype and regress the variants. (A)  Run genotyping missingness for the 2 centers. (B)  Test for deviation of the Hardy-Weinberg equilibrium. (C)

Genotypic center as phenotype SNP missingness between centers HWE Investigating the Problem In the next step, we identified all the SNPs with a p-value < 10-3 in every respective test. We removed them and then calculated the new principal components From the above, it is clear that the different genotypic centers is not the culprit, rather it seems to be associated with differences in HWE, which are a proxy for discordant calls between centers

Investigating the Problem Example: rs For this SNP, the Illuminus call was used for both centers. In Miami, a G allele was assigned and in Sanger an A allele was assigned. This means that the cluster assignment was likely reversed between sites. DataA1A2A1A1/A1A2/A2A2 Genotype Counts AllGA1969/0/6866 Miami Illuminus0G0/0/1969 Sanger Illuminus0A0/0/6866

GenoSNP Illuminus Illuminus fails to call the same allele even for some mono-allelic markers Investigating the Problem

The dichotomy of the first principal component is explained by calling discordances of the Illuminus caller. Probably a bug exists in the Illuminus calling algorithm where there are difficulties in making calls when less than 3 clusters exist. Solution: Re-QC using GenoSNP or Opticall (new) Solution to the Problem

Clean GenoSNP/IlluminusOpticall Solution to the Problem Using Opticall, the first principal component no longer splits the data in 2 separate clusters In later analyses, Opticall was determined to have less variation than GenoSNP in genotype frequencies between genotype calling batches

Final Assessment of Analysis: GIF 207, , ,311 24,388 production Failed QC 20,381 10,710 Monomorphic MAF > 5% 28,406 MAF 0.5-5% 108,517 MAF < 0.5% (Autosomal)

Multiple Testing In genetics, there have always been two opposing camps:  Liberals: They don’t worry about it at all. They report nominal P values and aren’t afraid to be wrong.  Conservatives: They worry about it all the time. They report only fully “corrected” P values. Common methods:  Bonferroni  False Discovery Rate