Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome-Wide Association Studies (GWAS) Study design: Case/Control, Family-based, Cohort Phenotype: Dichotomous, Quantitative 10 3 – 10 5 individuals; 10.

Similar presentations


Presentation on theme: "Genome-Wide Association Studies (GWAS) Study design: Case/Control, Family-based, Cohort Phenotype: Dichotomous, Quantitative 10 3 – 10 5 individuals; 10."— Presentation transcript:

1 Genome-Wide Association Studies (GWAS) Study design: Case/Control, Family-based, Cohort Phenotype: Dichotomous, Quantitative 10 3 – 10 5 individuals; 10 5 – 10 6 polymorphisms

2 Genotyping Cartesian coordinate view Polar coordinate view

3 Poor cluster plot Genotyping

4 PLINK What is PLINK? – A software to analyse phenotype/genotype data – Run from the command line Why should we use PLINK? – Probably the most common tool used to analyse (human) GWAS data – Free and open source – Designed to perform a wide range of basic, large-scale analyses in computationally efficient manner – Can be used on several platforms – No programming ability required and excellent documentation

5 http://pngu.mgh.harvard.edu/~purcell/plink/ The Original PLINK

6 https://www.cog-genomics.org/plink2/ PLINK 1.9: Speedier (but less informative)

7 How to get PLINK (for Windows) Determine if your PC is 32 bit or 64 bit

8 How to get PLINK (for Windows) Determine if your PC is 32 bit or 64 bit Download the relevant stable build from the PLINK 1.9 website to a convenient location, then unzip the PLINK program/executable

9 Running PLINK (on Windows) PLINK is run from the command prompt Navigate to the location of the data or PLINK executable using the cd command (cd = change directory)

10 Note The command prompt needs to be told where the PLINK executable and the data is Easiest to direct the command prompt to the same folder/directory as your data If the PLINK executable is here, all good If not, put the path of PLINK’s location in your environment path to make it easier to call > echo %PATH% > path = C:\PLINK_location;%PATH%

11 Note The command prompt needs to be told where the PLINK exexutable and the data is Easiest to direct the command prompt to the same folder/directory as your data If the PLINK executable is here, all good If not, put the path of PLINK’s location in your environment path to make it easier to call This process is temporary and will only work for the current window

12 File Formats filename.ped text based pedigree files filename.map text based map file filename.bed binary genotype file filename.bimmarker file filename.famfamily/individual file PHENO filephenotype file COVAR filecovariate file

13 PED Files: the individual & genotype file

14 MAP file: the marker information file Genetic distance

15 The PED and MAP file The Ped(igree) Files 1.Family ID 2.Individual ID 3.Paternal ID 4.Maternal ID 5.Sex (1=male; 2=female; other=unknown) 6.Phenotype (pheno) The Map Files 1.chromosome (1-22,X|23,Y|24,PAR|25,MT|26 or 0=unplaced) 2.rs# or snp identifier 3.Genetic distance (morgans) 4.Base-pair position (bp units)

16 Binary Files: A More Efficient Way to Store PED and MAP files BED file: binary file, genotype information BIM file: extended MAP file: two extra columns = allele names FAM file: first six columns of PED file

17 PLINK Commands > plink --file filename –-options –-out outfile filenamewithout extension, PLINK will look for filename.ped and filename.map optionsvarious kind of options, see the following slides and documentation outfileoptional output name (without extension); if --out is absent, output file will be named plink.suffix (where suffix depends on the option chosen) For PED/MAP files: > plink --bfile filename –-options –-out outfile For BED/BIM/FAM files: Note: may need to type plink.exe on windows to call the program

18 Rules to Remember Always consult the log file Consult the web documentation regularly PLINK has no memory, each run loads data anew, previous filters lost Exact syntax and spelling is important – “minus minus” … “dash dash” … “hyphen hyphen” Check the analyses are doing what you expect

19 Data Management --recodecreates a new PED/MAP fileset after applying any specified operations --make-bedcreates a new binary fileset after applying any specified operations --update-mapupdate variant base-pair positions; requires text file containing marker name in column 1 and new base-pair position in column 2 --update-idsupdate sample IDs; requires text file containing original family ID in column 1, original individual ID in column 2, new family ID in column 3 and new individual ID in column 4 --flipGiven a file containing a list of SNPs with A/C/G/T alleles, --flip swaps A↔T and C↔G --bmerge --bmerge merges a specified binary fileset with the input data (which is considered the reference) See https://www.cog-genomics.org/plink2/index for complete list of all optionshttps://www.cog-genomics.org/plink2/index

20

21 Input Filtering --keepaccepts a text file with family IDs in column 1 and individual IDs in column 2 and removes all unlisted samples from the current analysis --removeaccepts a text file with family IDs in column 1 and individual IDs in column 2 and removes all listed samples from the current analysis --mindfilters out all individuals with missing call rates exceeding the provided value --extractaccepts a text file with a list of variant IDs and removes all unlisted variants from the current analysis --excludeaccepts a text file with a list of variant IDs and removes all listed variants from the current analysis --chrexcludes all variants not on the listed chromosome(s); --from-kb and –to-kb may be added to restrict analysis to a particular region of the specified chromosome --genofilters out all variants with missing call rates exceeding the provided value --maffilters out all variants with minor allele frequency below the provided threshold --hwefilters out all variants which have Hardy-Weinberg equilibrium exact test p-value below the provided threshold

22 Quality Control of a GWAS dataset GWAS_build36.bed, GWAS_build36.bim, GWAS_build36.fam 897 cases and 963 controls (simulated phenotype) from Ireland and Britain genotyped on an Illumina chip 420755 markers genotyped, from chromosomes 1-22, X and pseudoautosomal regions Important to have build information Important to have strand information

23 Quality Control: Sample Call Rate & Heterozygosity Low call rate (high missingness) indicates poor DNA quality High heterozygosity can indicate sample contamination Low heterozygosity can occur for many reasons --autosome excludes all unplaced and non-autosomal variants --missingproduces sample-based (plink.imiss) and variant-based (plink.lmiss) missing data reports --hetcomputes observed and expected autosomal homozygous genotype counts for each sample Step 1: Create a QC file in Excel, with one individual per row Step 2: Calculate missingness rates for each person (based on good quality, autosomal SNPs) Step 3: Calculate heterozygosity values for each person (based on good quality, autosomal SNPs)

24 Quality Control: Sample Gender Check Compare listed gender with gender predicted based on X chromosome genotypes to identify potential sample mix-up ----check-sexcompares sex assignments in the input dataset with those imputed from X chromosome inbreeding coefficients. By default, F estimates smaller than 0.2 yield female calls, and values larger than 0.8 yield male calls Step 4: Perform a check-sex for each person (based on good quality X-chromosome SNPs) Step 5: Remove individuals with low call rates and/or failing the sex- check. Heterozygosity?

25 C/CC/C T/TT/T C/TC/T C/TC/T C/TC/T C/TC/T IBD= Identical by descent IBS= Identical by state 1 2 3 4 All of the children share 2 alleles IBS Child 1 & 2 share 2 alleles IBD Child 2 & 3 share 1 allele IBD Child 2 & 4 share 0 alleles IBD Quality Control: Genetic Relationships & IBD Sharing

26 Quality Control: Population Stratification  Imagine a sample of individuals drawn from a population consisting of two distinct subgroups which differ in allele frequency.  If the prevalence of disease is greater in one sub-population, then this group will be over-represented amongst the cases.  Any marker which is also of higher frequency in that subgroup will appear to be associated with the disease

27 Quality Control: Sample Ethnicity YRI JPT/CHB CEU Outliers Compare genetic similarity/dissimilarity of GWAS individuals to others of different ethnicity Use Principal Components Analysis (PCA)

28 Quality Control: Adjust for Population Structure

29 Linkage Disequilibrium (LD) Linkage disequilibrium: the non-random association of alleles at linked loci A measure of the tendency of some alleles to be inherited together on haplotypes descended from ancestral chromosomes Consider a G/A SNP and a nearby C/T SNP Theoretically, there are 4 possible haplotypes: G-T G-C A-T A-C If however only the G-T and A-C haplotypes are observed in the population, then the 2 SNPs are in perfect linkage disequilibrium, they are perfectly correlated If we genotype the first SNP, we know what the alleles are at the second SNP

30 Quality Control: Creating a LD-pruned Dataset Checking for relatedness is a relatively long process, but can be speeded up using a reduced dataset (i.e. less SNPs) PCA is LD-sensitive; the dataset must be LD-pruned first Makes sense to use the same (reduced) set of SNPs for both processes Step 6: Identify a list of good quality SNPs, excluding SNPs within known regions of extensive LD, and perform further LD-pruning Step 7: Create a new binary fileset, including only the SNPs identified in Step 6 --range --exclude normally removes all listed variants from the current analysis. With the 'range' modifier, all variants within chromosomal regions specified in a text file are excluded --indep-pairwiserequires three parameters: a window size in variant count or kilobase (if the 'kb' modifier is present) units, a variant count to shift the window at the end of each step, a pairwise r 2 threshold: at each step, pairs of variants in the current window with r 2 greater than the threshold are noted, and variants are greedily pruned from the window until no such pairs remain.

31 Relationship Testing: Pairwise IBD Calculations Step 8: Calculate pairwise IBD for all individuals remaining, but restrict reporting to those with PI-HAT values greater than 0.1 Step 9: Remove one of each related pair from dataset prior to population structure analysis --genomeinvokes an IBS/IBD computation, and then writes a report to plink.genome. The report includes the proportion of the genome shared IBD (PI-HAT) between pairs of individuals --minplink.genome files can be VERY large. –min can be used to restrict reporting to those pairs of individuals where PI-HAT exceeds a specified threshold (e.g. restrict to those related at 1 st cousin level or closer)

32 Population Structure: Identifying Non-Europeans Step 10: Merge unrelated pruned dataset with hapmap3 data (known ethnicities, forward strand, build36) Step 11: Flip strand in GWAS data for SNP flagged by bmerge process and make new binary dataset Step 12: Repeat merge with hapmap3, this time using flipped dataset. Add geno filter to remove any SNPs not genotyped in both datasets Step 13: Perform PCA on merged dataset and extract top 2 principal components. Include header in a tab-delimited report Step 14: Remove non-European individuals --bmerge--bmerge merges a specified binary fileset with the input data (which is considered the reference) --flipGiven a file containing a list of SNPs with A/C/G/T alleles, --flip swaps A↔T and C↔G --pca --pca extracts the top specified number of principal components of the variance- standardized relationship matrix. Eigenvectors are written to plink.eigenvec, and top eigenvalues are written to plink.eigenval. http://www.well.ox.ac.uk/~wrayner/strand/

33 Population Structure: Generate PCs to use as Covariates Step 15: Perform PCA on European dataset and extract top 10 principal components. Include header in a tab-delimited report Step16: Remove individuals failing QC from original GWAS dataset https://github.com/DReichLab/EIG

34 Analysis: Apply Filters and Analyse Step 17: Perform logistic regression analysis using PC(s) and any other appropriate covariates, applying appropriate SNP filters --genofilters out all variants with missing call rates exceeding the provided value --maffilters out all variants with minor allele frequency below the provided threshold --hwefilters out all variants which have Hardy-Weinberg equilibrium exact test p- value below the provided threshold --logisticperforms logistic regression given a case/control phenotype and some covariates --covar--covar designates the file to load covariates from. The file format is optional header line, FID and IID in first two columns, covariates in remaining columns. By default, the main phenotype is set to missing if any covariate is missing --covar-namelets you specify a subset of covariates to load, by column name; separate multiple column names with spaces or commas, and use dashes to designate ranges

35 Analysis: Example QQ Plots Quantile - Quantile (QQ) plots are informative Enrichment of low p-values May be true association No signal Population stratification? Polygenic signal?

36 Analysis: Generate Plots Step 18: Repeat analysis adding a flag to generate random P-values expected under the null hypothesis Step 19: Create QQ plot in R --adjust qq-plot--adjust causes an.adjusted file to be generated with each association test report, containing several basic multiple testing corrections for the raw p- values. 'qq-plot' adds a quantile column to simplify QQ plotting. data<-read.table(file="GWAS_build36_postQC_analysis_adj.assoc.logistic.adjusted", header=T) plot(-log(data$QQ, 10), -log(data$UNADJ,10), xlab = "Expected –logP values", ylab = "Observed –logP values") abline(a=0, b=1) Step 20: Create Manhattan plot in Haploview

37 Analysis: Generate Manhattan Plot in Haploview


Download ppt "Genome-Wide Association Studies (GWAS) Study design: Case/Control, Family-based, Cohort Phenotype: Dichotomous, Quantitative 10 3 – 10 5 individuals; 10."

Similar presentations


Ads by Google