Presentation is loading. Please wait.

Presentation is loading. Please wait.

ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Similar presentations


Presentation on theme: "ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows."— Presentation transcript:

1 ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows

2 Objective Parse VCF files Parse VCF files Calculate summary statistics across sliding windows throughout the genome Calculate summary statistics across sliding windows throughout the genome Implement NTFreq module to calculate nucleotide frequencies for each population and combined population Implement NTFreq module to calculate nucleotide frequencies for each population and combined population Implement TajimasD module to calculate Tajima’s D Implement TajimasD module to calculate Tajima’s D Implement GO module to annotate identified SNPs Implement GO module to annotate identified SNPs

3 Data set Simulated data set for chromosome 2R in Drosophila melanogaster Simulated data set for chromosome 2R in Drosophila melanogaster 1.4 Mbp 1.4 Mbp – 2 populations Pooled individuals per population Pooled individuals per population – 75bp reads, error rate 1% – 10,000 simulated SNPs 100x coverage per variant 100x coverage per variant At least 100bp apart At least 100bp apart Allelic Frequencies ranging from.1 to.9 per population Allelic Frequencies ranging from.1 to.9 per population

4 Data to Variant Call Format Index Reference Genome Only chromosome 2R of D. melanogaster -Genome build Dmel 3 from Flybase Use BWA to Align FastQ to Reference Genome Gap open penalty = 1Disallowing deletion within 12 bp of 3’UTR Gap extension max = 12Maximum level of gap extensions = 12 Gap extension max = 12Maximum level of gap extensions = 12 Use SAMTools to Remove Ambiguously mapped Regions (MAPQ >= 20) (MAPQ >= 20) Use BCFTools mpileup to Generate a Binary Code Format (BCF) BCF -> VCF FastQ -> sai -> SAM -> BAM - >.bcf -> VCF FastQ -> sai -> SAM -> BAM - >.bcf -> VCF

5 Formatting data: Parse VCF For each window: Fetch the VCF rows from each BCF file Fetch the VCF rows from each BCF file Convert the VCF rows into hashes of arrays Convert the VCF rows into hashes of arrays Compute the Theta, Pi, Tajima’s D for each population Compute the Theta, Pi, Tajima’s D for each population Compute Fst for each window between each population Compute Fst for each window between each population

6 Sliding windows Sliding window size is specified, and called modules are calculated across specified window size Sliding window size is specified, and called modules are calculated across specified window size

7 Module 1: Calculate allele frequencies Input is taken from parsed VCF file Input is taken from parsed VCF file Hashes are created for each population with the following structure Hashes are created for each population with the following structure – {SNP_location} {nucleotide} -> frequency; Hashes created for full dataset Hashes created for full dataset – {SNP_location}{Population} -> {nucleotide} ->frequency

8 Output site frequency spectra Site frequency spectrum (SFS) output as the following hash: Site frequency spectrum (SFS) output as the following hash: – {nonref_allele}{frequency}->count; Allows us to calculate a histogram for the non- reference allele frequencies Allows us to calculate a histogram for the non- reference allele frequencies Send output to R to generate SFS graphs Send output to R to generate SFS graphs

9 Module 2: Calculate Summary Statistics and Tajima’s D theta_pi (index of diversity) theta_pi (index of diversity) theta_watterson (index of diversity) theta_watterson (index of diversity)

10 Module 2: Calculate Summary Statistics and Tajima’s D Tajima’s D (index of selection/population expansion) Tajima’s D (index of selection/population expansion)

11 Module 3: F ST for DNA sequence Calculate F ST (index of differentiation) according to Hudson et al. 1992 Calculate F ST (index of differentiation) according to Hudson et al. 1992 1 – Hw/Hb Hw: average number of differences within each population Hb: average number of differences between the 2 populations

12 Module 4: GO annotations Module takes SNP list as input Module takes SNP list as input Outputs the following: Outputs the following: – List of genes that have overlap with SNP positions – Gene Ontology (GO) IDs and terms associated with each SNP matched gene – List of genes for a selected window Visualization using GOSlim Visualization using GOSlim

13 Data visualization Integrated Genomics Viewer (IGV) Integrated Genomics Viewer (IGV) Broad Institute Broad Institute http://www.broadinstitute.org/igv/ http://www.broadinstitute.org/igv/

14 SFS for population 1 and 2

15 Sliding window for summary statistics Phist greater than 0.1 in window 1080001 - 1100000 Go Accession IDOntologySpecific GO:0000124Cellular ComponentSpt-Ada-Gcn5-acetyltransferase complex GO:0005703Cellular Component(Thought to be a site of active transcription) GO:0005634Cellular Component(Nucleus) GO:0006911Biological ProcessPhagosome biosynthesis/formation GO:0045747Biological ProcessUp regulation of Notch signaling pathway GO:0006355Biological ProcessRegulation of cellular transcription, DNA-dependent GO:0000910Biological Process(Cytoplasm division) GO:0016773Molecular Function(Intermolecular transfer of phosphorus group to an alcohol group) GO:0005700Cellular Component(Polytene associated) GO:0005488Molecular Function(Ligand, non-covalent partner) GO:0005737Cellular Component(Ambiguous) GO:0035222Biological Process(Patterning in wing imaginal disc) GO:0005875Cellular Component(Microtubule associated) GO:0004672Molecular FunctionProtamine kinase activity GO:0000123Cellular ComponentHistone acetylase complex

16 Identify differentiated genomic regions For each window with a Fst > 0.1, print the name of the SNP and associated GO term For each window with a Fst > 0.1, print the name of the SNP and associated GO term Phist (Fst) greater than 0.1 in window 1080001 - 1100000 Go Accession IDOntologySpecific GO:0000124Cellular ComponentSpt-Ada-Gcn5-acetyltransferase complex GO:0005703Cellular Component(Thought to be a site of active transcription) GO:0005634Cellular Component(Nucleus) GO:0006911Biological ProcessPhagosome biosynthesis/formation GO:0045747Biological ProcessRegulation of cellular transcription, DNA-dependent GO:0000910Biological Process(Cytoplasm division) GO:0016773Molecular Function(Intermolecular transfer of phosphorus group to an alcohol group)GO:0005700Cellular Component(Polytene associated) GO:0005488Molecular Function(Ligand, non-covalent partner) GO:0005737Cellular Component(Ambiguous) GO:0035222Biological Process(Patterning in wing imaginal disc) GO:0005875Cellular Component(Microtubule associated) GO:0004672Molecular FunctionProtamine kinase activity GO:0000123Cellular ComponentHistone acetylase complex

17 Thank You Use PERL or die, print “ (X_x) ”; ##Hashes to Hashes## Print “ % 2 %”;


Download ppt "ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows."

Similar presentations


Ads by Google