ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.

Slides:



Advertisements
Similar presentations
A. Dereeper, G. Sarah, F. Sabot, Y. Hueber Exploiting SNP polymorphism data Formation Bio-informatique, 9 au 13 février 2015.
Advertisements

Genetic Basis of Agronomic Traits Connecting Phenotype to Genotype Yu and Buckler (2006); Zhu et al. (2008) Traditional F2 QTL MappingAssociation Mapping.
DNAseq analysis Bioinformatics Analysis Team
Variant Calling Workshop Chris Fields Variant Calling Workshop v2 | Chris Fields1 Powerpoint by Casey Hanson.
SOLiD Sequencing & Data
Signatures of Selection
Finding approximate palindromes in genomic sequences.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
Mapping NGS sequences to a reference genome. Why? Resequencing studies (DNA) – Structural variation – SNP identification RNAseq – Mapping transcripts.
Polymorphism and Variant Analysis Lab
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
RNAseq analyses -- methods
DAY 1. GENERAL ASPECTS FOR GENETIC MAP CONSTRUCTION SANGREA SHIM.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Polymorphism & Variant Analysis Lab Saurabh Sinha Polymorphism and Variant Analysis Lab v1 | Saurabh Sinha 1 Powerpoint by Casey Hanson.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Visualising NGS data in GBrowse 2 August 2009 GMOD Meeting 6-7 August 2009 Dave Clements GMOD Help Desk National Evolutionary Synthesis Center (NESCent)
Next Generation DNA Sequencing
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.
Data Mining in Ensembl with BioMart Giulietta Spudich.
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
Bioinformatics trainings, Vietnam Hanoi, November, 2015
Genome-wide association study between DSE polymorphism and Poly-A usage in Human population Hiren Karathia Sridhar Hannenhalli.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
Tutorial 6 High Throughput Sequencing. HTS tools and analysis Review of resequencing pipeline Visualization - IGV Analysis platform – Galaxy Tuning up.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
No reference available
Ke Lin 23 rd Feb, 2012 Structural Variation Detection Using NGS technology.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Accessing and visualizing genomics data
Personalized genomics
Objectives Genome-wide investigation – to estimate alternate Poly-Adenylation (APA) usage on 3’UTR – to identify polymorphism of Downstream Sequence Elements.
Calling Somatic Mutations using VarScan
Introduction of the ChIP-seq pipeline Shigeki Nakagome November 16 th, 2015 Di Rienzo lab meeting.
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Short Read Workshop Day 5: Mapping and Visualization
Canadian Bioinformatics Workshops
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
From Reads to Results Exome-seq analysis at CCBR
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
Canadian Bioinformatics Workshops
Day 5 Mapping and Visualization
Lesson: Sequence processing
Dowell Short Read Class Phillip Richmond
Integrative Genomics Viewer (IGV)
Introduction to RAD Acropora millepora.
Ssaha_pileup - a SNP/indel detection pipeline from new sequencing data
BF528 - Biological Data Formats
Next Gen. Sequencing Files and pysam
Next Gen. Sequencing Files and pysam
Next Gen. Sequencing Files and pysam
Nora Pierstorff Dept. of Genetics University of Cologne
Discussion Section Week 9
Computational Pipeline Strategies
Minor variants expand the range of neonatal HSV-2 coding diversity.
The Variant Call Format
Presentation transcript:

ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows

Objective Parse VCF files Parse VCF files Calculate summary statistics across sliding windows throughout the genome Calculate summary statistics across sliding windows throughout the genome Implement NTFreq module to calculate nucleotide frequencies for each population and combined population Implement NTFreq module to calculate nucleotide frequencies for each population and combined population Implement TajimasD module to calculate Tajima’s D Implement TajimasD module to calculate Tajima’s D Implement GO module to annotate identified SNPs Implement GO module to annotate identified SNPs

Data set Simulated data set for chromosome 2R in Drosophila melanogaster Simulated data set for chromosome 2R in Drosophila melanogaster 1.4 Mbp 1.4 Mbp – 2 populations Pooled individuals per population Pooled individuals per population – 75bp reads, error rate 1% – 10,000 simulated SNPs 100x coverage per variant 100x coverage per variant At least 100bp apart At least 100bp apart Allelic Frequencies ranging from.1 to.9 per population Allelic Frequencies ranging from.1 to.9 per population

Data to Variant Call Format Index Reference Genome Only chromosome 2R of D. melanogaster -Genome build Dmel 3 from Flybase Use BWA to Align FastQ to Reference Genome Gap open penalty = 1Disallowing deletion within 12 bp of 3’UTR Gap extension max = 12Maximum level of gap extensions = 12 Gap extension max = 12Maximum level of gap extensions = 12 Use SAMTools to Remove Ambiguously mapped Regions (MAPQ >= 20) (MAPQ >= 20) Use BCFTools mpileup to Generate a Binary Code Format (BCF) BCF -> VCF FastQ -> sai -> SAM -> BAM - >.bcf -> VCF FastQ -> sai -> SAM -> BAM - >.bcf -> VCF

Formatting data: Parse VCF For each window: Fetch the VCF rows from each BCF file Fetch the VCF rows from each BCF file Convert the VCF rows into hashes of arrays Convert the VCF rows into hashes of arrays Compute the Theta, Pi, Tajima’s D for each population Compute the Theta, Pi, Tajima’s D for each population Compute Fst for each window between each population Compute Fst for each window between each population

Sliding windows Sliding window size is specified, and called modules are calculated across specified window size Sliding window size is specified, and called modules are calculated across specified window size

Module 1: Calculate allele frequencies Input is taken from parsed VCF file Input is taken from parsed VCF file Hashes are created for each population with the following structure Hashes are created for each population with the following structure – {SNP_location} {nucleotide} -> frequency; Hashes created for full dataset Hashes created for full dataset – {SNP_location}{Population} -> {nucleotide} ->frequency

Output site frequency spectra Site frequency spectrum (SFS) output as the following hash: Site frequency spectrum (SFS) output as the following hash: – {nonref_allele}{frequency}->count; Allows us to calculate a histogram for the non- reference allele frequencies Allows us to calculate a histogram for the non- reference allele frequencies Send output to R to generate SFS graphs Send output to R to generate SFS graphs

Module 2: Calculate Summary Statistics and Tajima’s D theta_pi (index of diversity) theta_pi (index of diversity) theta_watterson (index of diversity) theta_watterson (index of diversity)

Module 2: Calculate Summary Statistics and Tajima’s D Tajima’s D (index of selection/population expansion) Tajima’s D (index of selection/population expansion)

Module 3: F ST for DNA sequence Calculate F ST (index of differentiation) according to Hudson et al Calculate F ST (index of differentiation) according to Hudson et al – Hw/Hb Hw: average number of differences within each population Hb: average number of differences between the 2 populations

Module 4: GO annotations Module takes SNP list as input Module takes SNP list as input Outputs the following: Outputs the following: – List of genes that have overlap with SNP positions – Gene Ontology (GO) IDs and terms associated with each SNP matched gene – List of genes for a selected window Visualization using GOSlim Visualization using GOSlim

Data visualization Integrated Genomics Viewer (IGV) Integrated Genomics Viewer (IGV) Broad Institute Broad Institute

SFS for population 1 and 2

Sliding window for summary statistics Phist greater than 0.1 in window Go Accession IDOntologySpecific GO: Cellular ComponentSpt-Ada-Gcn5-acetyltransferase complex GO: Cellular Component(Thought to be a site of active transcription) GO: Cellular Component(Nucleus) GO: Biological ProcessPhagosome biosynthesis/formation GO: Biological ProcessUp regulation of Notch signaling pathway GO: Biological ProcessRegulation of cellular transcription, DNA-dependent GO: Biological Process(Cytoplasm division) GO: Molecular Function(Intermolecular transfer of phosphorus group to an alcohol group) GO: Cellular Component(Polytene associated) GO: Molecular Function(Ligand, non-covalent partner) GO: Cellular Component(Ambiguous) GO: Biological Process(Patterning in wing imaginal disc) GO: Cellular Component(Microtubule associated) GO: Molecular FunctionProtamine kinase activity GO: Cellular ComponentHistone acetylase complex

Identify differentiated genomic regions For each window with a Fst > 0.1, print the name of the SNP and associated GO term For each window with a Fst > 0.1, print the name of the SNP and associated GO term Phist (Fst) greater than 0.1 in window Go Accession IDOntologySpecific GO: Cellular ComponentSpt-Ada-Gcn5-acetyltransferase complex GO: Cellular Component(Thought to be a site of active transcription) GO: Cellular Component(Nucleus) GO: Biological ProcessPhagosome biosynthesis/formation GO: Biological ProcessRegulation of cellular transcription, DNA-dependent GO: Biological Process(Cytoplasm division) GO: Molecular Function(Intermolecular transfer of phosphorus group to an alcohol group)GO: Cellular Component(Polytene associated) GO: Molecular Function(Ligand, non-covalent partner) GO: Cellular Component(Ambiguous) GO: Biological Process(Patterning in wing imaginal disc) GO: Cellular Component(Microtubule associated) GO: Molecular FunctionProtamine kinase activity GO: Cellular ComponentHistone acetylase complex

Thank You Use PERL or die, print “ (X_x) ”; ##Hashes to Hashes## Print “ % 2 %”;