Genome-Wide Association Studies (GWAS) Study design: Case/Control, Family-based, Cohort Phenotype: Dichotomous, Quantitative 10 3 – 10 5 individuals; 10.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

PLINK: a toolset for whole genome association analysis
BST 775 Lecture PLINK – A Popular Toolset for GWAS
Objectives Cover some of the essential concepts for GWAS that have not yet been covered Hardy-Weinberg equilibrium Meta-analysis SNP Imputation Review.
Basics of Linkage Analysis
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
:NEUROPSYCHIATRIC GENETICS [BIOSTATISTICS|BIOINFORMATICS] CORE BIOSTATISTIC/BIOINFORMATIC TOOLS FOR GENETICS DATA: DATA MANAGEMENT AND ANALYSIS RICHARD.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
SNPs DNA differs between humans by 0.1%, (1 in 1300 bases) This means that you can map DNA variation to around 10,000,000 sites in the genome Almost all.
Linkage Analysis in Merlin
Analysis of genome-wide association studies
Copy the folder… Faculty/Sarah/Tues_merlin to the C Drive C:/Tues_merlin.
Polymorphism and Variant Analysis Lab
PLINK tutorial, December 2006; Shaun Purcell, PLINK gPLINK Haploview Whole genome association software tutorial Shaun Purcell.
Population Stratification
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
Gene Hunting: Linkage and Association
C Reactive Protein Coronary Heart Disease Genetics Collaboration BMJ 2011;342:d548.
Introduction to the Gramene Genetic Diversity module 5/2010 Build #31.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Jeff O’ConnellInterbull annual meeting, Orlando, FL, July 2015 (1) J. R. O’Connell 1 and P. M. VanRaden 2 1 University of Maryland School of Medicine,
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
INTRODUCTION TO ASSOCIATION MAPPING
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Statistical Issues in Genetic Association Studies
Genome-wide association studies (GWAS) Thomas Hoffmann Department of Epidemiology and Biostatistics, and Institute for Human Genetics.
PLINK / Haploview Whole genome association software tutorial
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
GenABEL: an R package for Genome Wide Association Analysis
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Copyright OpenHelix. No use or reproduction without express written consent1.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Data Quality Control Suzanne M. Leal Baylor College of Medicine Copyrighted © S.M. Leal 2015.
Linkage. Announcements Problem set 1 is available for download. Due April 14. class videos are available from a link on the schedule web page, and at.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Population stratification
Date of download: 11/12/2016 Copyright © 2016 American Medical Association. All rights reserved. From: Influence of Child Abuse on Adult DepressionModeration.
Regression Models for Linkage: Merlin Regress
Quality Control Using EasyQC & Meta-Analysis in METAL
Common variation, GWAS & PLINK
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Genome Wide Association Studies using SNP
Linkage analysis & Homozygosity mapping
Marker heritability Biases, confounding factors, current methods, and best practices Luke Evans, Matthew Keller.
Genome-Wide Pharmacogenomic Study on Methadone Maintenance Treatment
Recombination (Crossing Over)
GxG and GxE.
Preparing data for GWAS analysis
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Population stratification
Introduction to Data Formats and tools
Genome-wide Association Studies
Model-free Estimation of Recent Genetic Relatedness
Genome-Wide Association Studies: Present Status and Future Directions
Genome-wide Complex Trait Analysis and extensions
Brent S. Pedersen, Aaron R. Quinlan 
A Fast, Powerful Method for Detecting Identity by Descent
X-chromosomal markers and FamLinkX
Population Stratification Practical
Presentation transcript:

Genome-Wide Association Studies (GWAS) Study design: Case/Control, Family-based, Cohort Phenotype: Dichotomous, Quantitative 10 3 – 10 5 individuals; 10 5 – 10 6 polymorphisms

Genotyping Cartesian coordinate view Polar coordinate view

Poor cluster plot Genotyping

PLINK What is PLINK? – A software to analyse phenotype/genotype data – Run from the command line Why should we use PLINK? – Probably the most common tool used to analyse (human) GWAS data – Free and open source – Designed to perform a wide range of basic, large-scale analyses in computationally efficient manner – Can be used on several platforms – No programming ability required and excellent documentation

The Original PLINK

PLINK 1.9: Speedier (but less informative)

How to get PLINK (for Windows) Determine if your PC is 32 bit or 64 bit

How to get PLINK (for Windows) Determine if your PC is 32 bit or 64 bit Download the relevant stable build from the PLINK 1.9 website to a convenient location, then unzip the PLINK program/executable

Running PLINK (on Windows) PLINK is run from the command prompt Navigate to the location of the data or PLINK executable using the cd command (cd = change directory)

Note The command prompt needs to be told where the PLINK executable and the data is Easiest to direct the command prompt to the same folder/directory as your data If the PLINK executable is here, all good If not, put the path of PLINK’s location in your environment path to make it easier to call > echo %PATH% > path = C:\PLINK_location;%PATH%

Note The command prompt needs to be told where the PLINK exexutable and the data is Easiest to direct the command prompt to the same folder/directory as your data If the PLINK executable is here, all good If not, put the path of PLINK’s location in your environment path to make it easier to call This process is temporary and will only work for the current window

File Formats filename.ped text based pedigree files filename.map text based map file filename.bed binary genotype file filename.bimmarker file filename.famfamily/individual file PHENO filephenotype file COVAR filecovariate file

PED Files: the individual & genotype file

MAP file: the marker information file Genetic distance

The PED and MAP file The Ped(igree) Files 1.Family ID 2.Individual ID 3.Paternal ID 4.Maternal ID 5.Sex (1=male; 2=female; other=unknown) 6.Phenotype (pheno) The Map Files 1.chromosome (1-22,X|23,Y|24,PAR|25,MT|26 or 0=unplaced) 2.rs# or snp identifier 3.Genetic distance (morgans) 4.Base-pair position (bp units)

Binary Files: A More Efficient Way to Store PED and MAP files BED file: binary file, genotype information BIM file: extended MAP file: two extra columns = allele names FAM file: first six columns of PED file

PLINK Commands > plink --file filename –-options –-out outfile filenamewithout extension, PLINK will look for filename.ped and filename.map optionsvarious kind of options, see the following slides and documentation outfileoptional output name (without extension); if --out is absent, output file will be named plink.suffix (where suffix depends on the option chosen) For PED/MAP files: > plink --bfile filename –-options –-out outfile For BED/BIM/FAM files: Note: may need to type plink.exe on windows to call the program

Rules to Remember Always consult the log file Consult the web documentation regularly PLINK has no memory, each run loads data anew, previous filters lost Exact syntax and spelling is important – “minus minus” … “dash dash” … “hyphen hyphen” Check the analyses are doing what you expect

Data Management --recodecreates a new PED/MAP fileset after applying any specified operations --make-bedcreates a new binary fileset after applying any specified operations --update-mapupdate variant base-pair positions; requires text file containing marker name in column 1 and new base-pair position in column 2 --update-idsupdate sample IDs; requires text file containing original family ID in column 1, original individual ID in column 2, new family ID in column 3 and new individual ID in column 4 --flipGiven a file containing a list of SNPs with A/C/G/T alleles, --flip swaps A↔T and C↔G --bmerge --bmerge merges a specified binary fileset with the input data (which is considered the reference) See for complete list of all optionshttps://

Input Filtering --keepaccepts a text file with family IDs in column 1 and individual IDs in column 2 and removes all unlisted samples from the current analysis --removeaccepts a text file with family IDs in column 1 and individual IDs in column 2 and removes all listed samples from the current analysis --mindfilters out all individuals with missing call rates exceeding the provided value --extractaccepts a text file with a list of variant IDs and removes all unlisted variants from the current analysis --excludeaccepts a text file with a list of variant IDs and removes all listed variants from the current analysis --chrexcludes all variants not on the listed chromosome(s); --from-kb and –to-kb may be added to restrict analysis to a particular region of the specified chromosome --genofilters out all variants with missing call rates exceeding the provided value --maffilters out all variants with minor allele frequency below the provided threshold --hwefilters out all variants which have Hardy-Weinberg equilibrium exact test p-value below the provided threshold

Quality Control of a GWAS dataset GWAS_build36.bed, GWAS_build36.bim, GWAS_build36.fam 897 cases and 963 controls (simulated phenotype) from Ireland and Britain genotyped on an Illumina chip markers genotyped, from chromosomes 1-22, X and pseudoautosomal regions Important to have build information Important to have strand information

Quality Control: Sample Call Rate & Heterozygosity Low call rate (high missingness) indicates poor DNA quality High heterozygosity can indicate sample contamination Low heterozygosity can occur for many reasons --autosome excludes all unplaced and non-autosomal variants --missingproduces sample-based (plink.imiss) and variant-based (plink.lmiss) missing data reports --hetcomputes observed and expected autosomal homozygous genotype counts for each sample Step 1: Create a QC file in Excel, with one individual per row Step 2: Calculate missingness rates for each person (based on good quality, autosomal SNPs) Step 3: Calculate heterozygosity values for each person (based on good quality, autosomal SNPs)

Quality Control: Sample Gender Check Compare listed gender with gender predicted based on X chromosome genotypes to identify potential sample mix-up ----check-sexcompares sex assignments in the input dataset with those imputed from X chromosome inbreeding coefficients. By default, F estimates smaller than 0.2 yield female calls, and values larger than 0.8 yield male calls Step 4: Perform a check-sex for each person (based on good quality X-chromosome SNPs) Step 5: Remove individuals with low call rates and/or failing the sex- check. Heterozygosity?

C/CC/C T/TT/T C/TC/T C/TC/T C/TC/T C/TC/T IBD= Identical by descent IBS= Identical by state All of the children share 2 alleles IBS Child 1 & 2 share 2 alleles IBD Child 2 & 3 share 1 allele IBD Child 2 & 4 share 0 alleles IBD Quality Control: Genetic Relationships & IBD Sharing

Quality Control: Population Stratification  Imagine a sample of individuals drawn from a population consisting of two distinct subgroups which differ in allele frequency.  If the prevalence of disease is greater in one sub-population, then this group will be over-represented amongst the cases.  Any marker which is also of higher frequency in that subgroup will appear to be associated with the disease

Quality Control: Sample Ethnicity YRI JPT/CHB CEU Outliers Compare genetic similarity/dissimilarity of GWAS individuals to others of different ethnicity Use Principal Components Analysis (PCA)

Quality Control: Adjust for Population Structure

Linkage Disequilibrium (LD) Linkage disequilibrium: the non-random association of alleles at linked loci A measure of the tendency of some alleles to be inherited together on haplotypes descended from ancestral chromosomes Consider a G/A SNP and a nearby C/T SNP Theoretically, there are 4 possible haplotypes: G-T G-C A-T A-C If however only the G-T and A-C haplotypes are observed in the population, then the 2 SNPs are in perfect linkage disequilibrium, they are perfectly correlated If we genotype the first SNP, we know what the alleles are at the second SNP

Quality Control: Creating a LD-pruned Dataset Checking for relatedness is a relatively long process, but can be speeded up using a reduced dataset (i.e. less SNPs) PCA is LD-sensitive; the dataset must be LD-pruned first Makes sense to use the same (reduced) set of SNPs for both processes Step 6: Identify a list of good quality SNPs, excluding SNPs within known regions of extensive LD, and perform further LD-pruning Step 7: Create a new binary fileset, including only the SNPs identified in Step 6 --range --exclude normally removes all listed variants from the current analysis. With the 'range' modifier, all variants within chromosomal regions specified in a text file are excluded --indep-pairwiserequires three parameters: a window size in variant count or kilobase (if the 'kb' modifier is present) units, a variant count to shift the window at the end of each step, a pairwise r 2 threshold: at each step, pairs of variants in the current window with r 2 greater than the threshold are noted, and variants are greedily pruned from the window until no such pairs remain.

Relationship Testing: Pairwise IBD Calculations Step 8: Calculate pairwise IBD for all individuals remaining, but restrict reporting to those with PI-HAT values greater than 0.1 Step 9: Remove one of each related pair from dataset prior to population structure analysis --genomeinvokes an IBS/IBD computation, and then writes a report to plink.genome. The report includes the proportion of the genome shared IBD (PI-HAT) between pairs of individuals --minplink.genome files can be VERY large. –min can be used to restrict reporting to those pairs of individuals where PI-HAT exceeds a specified threshold (e.g. restrict to those related at 1 st cousin level or closer)

Population Structure: Identifying Non-Europeans Step 10: Merge unrelated pruned dataset with hapmap3 data (known ethnicities, forward strand, build36) Step 11: Flip strand in GWAS data for SNP flagged by bmerge process and make new binary dataset Step 12: Repeat merge with hapmap3, this time using flipped dataset. Add geno filter to remove any SNPs not genotyped in both datasets Step 13: Perform PCA on merged dataset and extract top 2 principal components. Include header in a tab-delimited report Step 14: Remove non-European individuals --bmerge--bmerge merges a specified binary fileset with the input data (which is considered the reference) --flipGiven a file containing a list of SNPs with A/C/G/T alleles, --flip swaps A↔T and C↔G --pca --pca extracts the top specified number of principal components of the variance- standardized relationship matrix. Eigenvectors are written to plink.eigenvec, and top eigenvalues are written to plink.eigenval.

Population Structure: Generate PCs to use as Covariates Step 15: Perform PCA on European dataset and extract top 10 principal components. Include header in a tab-delimited report Step16: Remove individuals failing QC from original GWAS dataset

Analysis: Apply Filters and Analyse Step 17: Perform logistic regression analysis using PC(s) and any other appropriate covariates, applying appropriate SNP filters --genofilters out all variants with missing call rates exceeding the provided value --maffilters out all variants with minor allele frequency below the provided threshold --hwefilters out all variants which have Hardy-Weinberg equilibrium exact test p- value below the provided threshold --logisticperforms logistic regression given a case/control phenotype and some covariates --covar--covar designates the file to load covariates from. The file format is optional header line, FID and IID in first two columns, covariates in remaining columns. By default, the main phenotype is set to missing if any covariate is missing --covar-namelets you specify a subset of covariates to load, by column name; separate multiple column names with spaces or commas, and use dashes to designate ranges

Analysis: Example QQ Plots Quantile - Quantile (QQ) plots are informative Enrichment of low p-values May be true association No signal Population stratification? Polygenic signal?

Analysis: Generate Plots Step 18: Repeat analysis adding a flag to generate random P-values expected under the null hypothesis Step 19: Create QQ plot in R --adjust qq-plot--adjust causes an.adjusted file to be generated with each association test report, containing several basic multiple testing corrections for the raw p- values. 'qq-plot' adds a quantile column to simplify QQ plotting. data<-read.table(file="GWAS_build36_postQC_analysis_adj.assoc.logistic.adjusted", header=T) plot(-log(data$QQ, 10), -log(data$UNADJ,10), xlab = "Expected –logP values", ylab = "Observed –logP values") abline(a=0, b=1) Step 20: Create Manhattan plot in Haploview

Analysis: Generate Manhattan Plot in Haploview