Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Data Formats and tools

Similar presentations


Presentation on theme: "Introduction to Data Formats and tools"— Presentation transcript:

1 Introduction to Data Formats and tools
Data formats for GWAS and Plink Shaun Aron Sydney Brenner Institute for Molecular Bioscience University of the Witwatersrand

2 Measure intensities Genotype calling Variant QC Sample QC Association

3 Imputation Genotype Calling Tools Genome Studio genCall zCall
Variant and Sample QC Plink EigenSoft R Association Testing GEMMA Genotype Reports iDAT Plink Plink Imputation Tools Impute2 PBWT Online

4 Genotype calling

5 Genotype Calls to plink
Some genotyping software exports data in Plink format There are available tools, scripts to convert genotyping reports to plink format or you can do it yourself

6 Plink Plink is standard tool for manipulating and analyzing genotype data Plink works with standard data formats Has the functionality to convert between different formats Developed and optimized for working with biallelic SNP data Plink online manual

7 DATA formats Family ID Individual ID Paternal ID Maternal ID
PED format PED file – Sample/Individual information MAP file – SNV information No header with 6 first defined columns Family ID Individual ID Paternal ID Maternal ID Sex (1=male, 2=female, other) Phenotype (missing -9, control 1, case 2 or quantitative trait) Followed by allele calls for the variant in a pairwise fashion – Different encodings

8 PED Format FID IID PATID PHENO Alleles for SNP 1 Alleles for SNP 2
MATID SEX

9 MAP File Chr SNP ID BP Position Genetic Distance (morgans)

10 DATA Storage Plain text format for thousands of samples for millions of SNPs would require a large amount of space for storage Plink rather works with Binary versions of the PED files Method to compress and reduce the size of the PED and MAP files

11 Binary PED format FAM file – one row per individual – first 6 columns of PED file BIM file – one row per SNP – MAP file + two alleles for that SNP BED file – one row per individual – genotype calls for each individual for all SNPs – rest of PED file in binary format FAM and BIM file are human readable while BED file in not

12 FAM File FID IID PATID SEX PHENO MATID Chr SNP ID BP Position SNP Alleles BIM File Genetic Distance (morgans)

13 Other formats Plink takes in various other data formats
Able to convert from other formats into Plink format

14 Plink basics

15 Plink basics Command line based Call Plink using plink command

16 Plink basics Flags are used for different operations
Eg. --file used to tell plink the name of the prefix of the input files and the format Eg. --file hapmap1 In your current directory you should have your data in PED format: hapmap1.map, hapmap1.ped Try it now

17 Plink basics Output files have a plink prefix by default. Use --out flag to specify your own name If you want to explicitly convert to binary format you may use the --make-bed flag

18 Plink Basics Examine your newly generated files
Identify what each row and column denotes Remember that you cannot open the .bed file - not human readable If you are reading in a file in binary PED format use the --bfile flag

19 Run through exercise 2

20 Plink COmmands Command Action --recode Transform between formats
--freq Generate simple statistics --vcf Read in file in VCF format --keep [file] Retain samples in the specified file --remove [file] Remove samples in the specified file --extract [file] Keep SNPs in the specified file --exclude [file] Remove SNPS in the specified file --pheno [file] Read phenotypes from specified file

21 Plink FIltering May be a need to extract specific parts of a complete dataset Specific SNPs or Individuals Can extract either SNPs or Individuals directly on the command line or using a file with a specific format

22 Plink Filtering For individual filtering you can create a file with the FID and IID of the individuals you want to keep or remove. For SNP filtering you can create a file with the SNPs IDs you would like to extract or exclude. In both cases you would most likely generate a new dataset.

23 Plink Filtering Sample File SNP file --keep --remove
--extract --exclude

24 PHENO FILE Phenotypes can be added to the PED or BIM file
In some instances it is useful to store them in a separate file Use --pheno flag followed by file with the following format FID IID PHENO

25 Plink Filtering Another useful flag is the --filter filter flag
Uses the same file format as the phenotype file Also has some built is filtering functions --filter-cases --filter-controls --filter-males --filter-females

26 Run through exercise 3.1 – 3.11

27 Selection based on Criteria
Flags defined to select samples/SNPs based on specific criteria Will come across these again in the QC section of the course

28 Plink filtering Command Action --hwe [threshold]
Keep variants with HWE p<threshold --missing Compute per-sample and per-variant missingness --check-sex Check genotype vs phenotype sex based on X chr --genome Compute relatedness based on IBD --maf [threshold] Keep variants with a MAF> threshold --mind [value] Remove individuals with missing data above value --geno [value] Remove SNPs with missing data above value

29 Criteria Selection Flags
--mind value of 0.02 denotes that all individuals with more than 2% of missing data should be removed --geno 0.04 – value of 0.04 indicates that all SNPs with a call rate of less that 96% should be removed --maf 0.01 – value of 0.01 indicates that all SNPs with a minor allele frequency of less than 1% should be removed

30 Go through exercise 3.11 – 3.22

31 Merging Datasets Plink has built in tools for merging datasets
Not a straight forward process but useful for population studies Datasets need to be from the same build, have SNPs called on the same strands etc. Section 4 deals with how to merge data successfully

32 Association TEsting Plink provides a number of association testing approaches --assoc – assumes there are case/control values in the phenotype column of your PED/BED file or specified phenotype file and runs a simple chi-squared association test --assoc – assumes there are quantitative values in the phenotype column of your PED/BED file or specified phenotype file and runs a regression analysis --linear – runs a linear regression for a quantitative trait allowing for the inclusion of covariates Additional options to run adjust for multiple testing and permutation testing This will be covered in more detail in the course

33 RUN through sections 4 and 5


Download ppt "Introduction to Data Formats and tools"

Similar presentations


Ads by Google