Presentation on theme: "BST 775 Lecture PLINK – A Popular Toolset for GWAS"— Presentation transcript:
1BST 775 Lecture PLINK – A Popular Toolset for GWAS Guodong WuSSG, Department of BiostatisticsUniversity of Alabama at BirminghamSeptember 24, 2013
2Overview Designed for GWAS and population-based linkage analysis. Developed by Shaun Purcell*, current version V1.07.Why the toolset is so popular?Store the GWAS data sets, which is too large for SAS, R, or other statistical packages.Well developed guideline and toolsets for Dataset Management and Quality ControlPlatform for various association methods* Purcell et al 2007, AJHG
3Overview Data management Summary statistics Quality Control Association Test
4PLINK in GWAS workflow Experimental Design & Sample Collection Cell Intensity Files for each chipGeneChip ScannerSummary statistics and quality controlPhenotype, sex and other covariatesAssessment of population stratificationWhole genome SNP-based associationFurther exploration of ‘hits’Visualization and follow-up
5Data Format PED and MAP format Transposed format SNP information SNPs →SNP information1 snpX snpY snpXY snpMT snpP1 A A A C C G T T A A T TP2 A C A A C G G T A C T TP3 C C A C G G T T A A T TP4 C C A A G G G T A A T T←PeopleTransposed formatPeople →People informationS1 A A A C C C C CS2 A C A A A C A AS3 C G C G G G G GS4 T T C G T T G TS5 A A G T A A A AS6 T T A C T T T T←SNPsP1 …P2 …P3 …P4 …P5 …Compact binary format
6Data management Recode dataset (A,C,G,T → 1,2) Reorder, reformat datasetFlip DNA strandExtract/remove individuals/SNPsNew phenotypes, covariates as extra fileMerge 2 or more data sets
7Summary and QC Hardy-Weinberg test Mendel errors Missing genotypes Allele frequenciesTests of non-random missingnessby phenotype and by (unobserved) genotypeSex CheckPairwise IBD estimates
8Mendel errors An exact test by default. plink --file data --hardyAn exact test by default.In Case control study, the Control group typically needs more lenient threshold (eg. P-value < 1e-3)
9Mendel errorsplink --file data --mendelGenotyping error when child’s genotype is not inherited from the parents, according to mendel’s lawOutput asOutput the error rate for each SNP and each individualCode Pat , Mat -> OffspringAA , AA -> ABBB , BB -> ABBB , ** -> AA** , BB -> AABB , BB -> AAAA , ** -> BB** , AA -> BBAA , AA -> BB
10Missingness and Allele Frequency Output each SNP’s allele frequency plink --file data --missingOutput the missing rate per SNP and per individual.plink --file data --freqOutput each SNP’s allele frequency
11Is the missingness random? plink --file data –-test-missingTest whether the SNP is randomly missing between case and control status.plink --file data -–test-mishapTest whether the SNP is randomly missing based on observed genotyped nearby SNPs.Assume dense SNP genotyping.Use haplotype and LD information in tests.
12Sex Checkplink --file data –check-sexUse X chromosome data heterozygosity rates to determine sex, and then compare with the observed sex.
13Pairwise IBD sharing (relatedness) Most recent common ancestor fromhomogeneous random mating populationParentsABACABACIBS = 1IBD = 0ABACPLINK tutorial, October 2006; Shaun Purcell,
14Relatedness Checkplink --file data –-genomeThe Genome-wide information, typically do not need whole-genome SNPs.Typically 100K independent SNPs are enough.
19Cardinal rules in PLINK Always consult the log file, console outputAlso consult the web documentationregularlyPLINK has no memoryeach run loads data anew, previous filters lostExact syntax and spelling is important“minus minus” …PLINK tutorial, October 2006; Shaun Purcell,