Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics.

Slides:



Advertisements
Similar presentations
DNAseq analysis Bioinformatics Analysis Team
Advertisements

MCB Lecture #21 Nov 20/14 Prokaryote RNAseq.
MCB Lecture #20 Nov 18/14 Reference alignments.
High Throughput Sequencing
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Pathogen Informatics 26 th Nov 2013 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics Wellcome Trust Sanger Institute Hinxton, Cambridge,
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
Before we start: Align sequence reads to the reference genome
Sequencing Data Quality Saulo Aflitos. Read (≈100bp) Contig (≈2Kbp) Scaffold (≈ 2Mbp) Pseudo Molecule (Super Scaffold) Paired-End Mate-Pair LowComplexityRegion.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
Li and Dewey BMC Bioinformatics 2011, 12:323
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
BIF Group Project Group (A)rabidopsis: David Nieuwenhuijse Matthew Price Qianqian Zhang Thijs Slijkhuis Species: C. Elegans Project: Advanced.
Next Generation DNA Sequencing
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
RNA-seq workshop ALIGNMENT
An Introduction to RNA-Seq Transcriptome Profiling with iPlant.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Adding GO GO Workshop 3-6 August GOanna results and GOanna2ga 2. gene association files 3. getting GO for your dataset 4. adding more GO (introduction)
The iPlant Collaborative
ParSNP Hash Pipeline to parse SNP data and output summary statistics across sliding windows.
Alexis DereeperCIBA courses – Brasil 2011 Detection and analysis of SNP polymorphisms.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
Genome-wide association study between DSE polymorphism and Poly-A usage in Human population Hiren Karathia Sridhar Hannenhalli.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
An Introduction to RNA-Seq Transcriptome Profiling with iPlant (
Genome STRiP ASHG Workshop demo materials
No reference available
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Personalized genomics
Objectives Genome-wide investigation – to estimate alternate Poly-Adenylation (APA) usage on 3’UTR – to identify polymorphism of Downstream Sequence Elements.
Computing on TSCC Make a folder for the class and move into it –mkdir –p /oasis/tscc/scratch/username/biom262_harismendy –cd /oasis/tscc/scratch/username/biom262_harismendy.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Case study: Saccharomyces cerevisiae grown under two different conditions RNAseq data plataform: Illumina Goal: Generate a platform where the user will.
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Canadian Bioinformatics Workshops
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Overview of Genomics Workflows
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
From Reads to Results Exome-seq analysis at CCBR
Case study: Saccharomyces cerevisiae grown under two different conditions RNAseq data plataform: Illumina Goal: Generate a platform where the user will.
DAY 2. GETTING FAMILIAR WITH NGS SANGREA SHIM. INDEX  Day 2  Get familiar with NGS  Understanding of NGS raw read file  Quality issue  Alignment/Mapping.
SNP and Genomic analysis SNP/genomic signature Clinical sampling Personalized chemotherapy Personalized Targeted therapy Personalized RNA therapy Personalized.
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Data and Hartwig Medical Foundation
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Cancer Genomics Core Lab
Dowell Short Read Class Phillip Richmond
RNA Sequencing Day 7 Wooohoooo!
MGmapper A tool to map MetaGenomics data
How to store and visualize RNA-seq data
MiSeq Validation Pipeline
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Maximize read usage through mapping strategies
Computational Pipeline Strategies
Automating NGS Gene Panel Analysis Workflows
RNA-Seq Data Analysis UND Genomics Core.
The Variant Call Format
Presentation transcript:

Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics

Pathogen Informatics 21 st Nov 2014 Pathogen Informatics Team ▸ Team of 9 software developers and bioinformaticians Carla Cummins

Pathogen Informatics 21 st Nov 2014 Role of Pathogen Informatics ▸ Informatics support to pathogen variation programme ▸ Dougan, Lawley, Parkhill, Berriman, Thomson & Kellam faculty teams ▸ Researchers, visiting workers, collaborators ▸ Approx. 120 people ▸ Applications and systems to support research activities ▸ Automated pipelines for sequence tracking and analysis ▸ Ad-hoc bioinformatics support and training

Pathogen Informatics 21 st Nov 2014 Cumulative Number of Tbp Sequenced

Pathogen Informatics 21 st Nov 2014 Cumulative Number of Samples Sequenced 107K 85K

Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  Assembly Assembly  Annotation Annotation  Mapping Mapping  Variant Calling Variant Calling  RNA-Seq Expression RNA-Seq Expression  QC QC  Sequence Tracking Sequence Tracking

Pathogen Informatics 21 st Nov 2014 Pathogen Tracking and Import Pipeline ▸ Cron regularly checks iRODs for new sequencing data ▸ Populate pathogen tracking database with metadata ▸ iRODS, warehouse ▸ Only update lanes where NPG QC is complete ▸ Converts bams to fastq and store on disk ▸ Convert bax5 files to fastq and store on disk Warehouse (sequencing informatics) Warehouse (sequencing informatics) iRODS (sequencing informatics) iRODS (sequencing informatics) Pathogen Tracking Pathogen Tracking Pathogen Disk Pathogen Disk cron Sequencescape (sequencing informatics) Sequencescape (sequencing informatics) Changes made in warehouse/iRODs (~24 hours) Register study Request sequencing Change meta-data

Pathogen Informatics 21 st Nov 2014 Finding Data Script: pathfind Examples: ▸ Where is the FASTQ for a lane: pathfind -t –id 1234_5 ▸ Make a symlink to FASTQ pathfind -t –id 1234_5 –symlink ▸ Find all FASTQs for a species: pathfind -t species -i Staph ▸ Output lane stats to a.csv file: pathfind -t species -i Staph –results out.csv ▸ Get all the options: pathfind -h

Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  Assembly Assembly  Annotation Annotation  Mapping Mapping  Variant Calling Variant Calling  RNA-Seq Expression RNA-Seq Expression  Sequence Tracking Sequence Tracking  QC QC

Pathogen Informatics 21 st Nov 2014 QC Pipeline ▸ Align 100MB to reference with bwa ▸ Generate QC stats ▸ Basic statistics on fastq e.g. yield ▸ Percent reads/bases mapped ▸ Percent genome covered ▸ Error rate ▸ Create QC Plots ▸ GC plot vs. reference GC ▸ Insert size distribution ▸ Base quality ▸ Coverage ▸ Run Kraken ▸ assigns taxonomic labels to short DNA sequences ▸ Results presented through QCGrind web interface

Pathogen Informatics 21 st Nov 2014 QCGrind

Pathogen Informatics 21 st Nov 2014 Kraken Results Script: qcfind Examples: ▸ Where is the kraken report for a lane: qcfind –t lane -i 1234_5 ▸ Where are the kraken reports for a study: qcfind –t study –i 3249

Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Annotation Annotation  Mapping Mapping  Variant Calling Variant Calling  RNA-Seq Expression RNA-Seq Expression  Sequence Tracking Sequence Tracking  Assembly Assembly

Pathogen Informatics 21 st Nov 2014 Assembly Pipeline ▸ Bacteria samples are assembled automatically ▸ Virus samples are assembled automatically on a study by study basis ▸ Eukaryote samples are assembled on a per request basis ▸ PacBio samples are assembled automatically using HGAP

Pathogen Informatics 21 st Nov 2014 Assembly Pipeline

Pathogen Informatics 21 st Nov 2014 Assembly Pipeline

Pathogen Informatics 21 st Nov 2014 Assembly: get results Script: assemblyfind Examples: ▸ Create symlinks to all the final assemblies in the given study assemblyfind -t study -id "My study" –symlink ▸ Find an assembly for a given lane assemblyfind -t lane -id 1234_5#6 ▸ Make a.csv file of assembly stats for a given species: assemblyfind -t species -i "Leishmania donovani” -stats ▸ Get all the options: assemblyfind -h

Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Assembly Assembly  Mapping Mapping  Variant Calling Variant Calling  RNA-Seq Expression RNA-Seq Expression  Sequence Tracking Sequence Tracking  Annotation Annotation

Pathogen Informatics 21 st Nov 2014 Annotation Pipeline ▸ Run automatically on all Bacteria denovo assemblies (also works for Viruses) ▸ Can be run in standalone mode annotate_bacteria ▸ Annotation ready for submission to EMBL/Genbank ▸ Pipeline Steps ▸ Genes predicted with Prodigal ▸ RNA predicted with Infernal ▸ The databases are searched in the following order: ▸ Genus specific RefSeq databases ▸ UniprotKB – bacteria/virus databases ▸ Conserved domain database ▸ pfam (A) ▸ rfam

Pathogen Informatics 21 st Nov 2014 Annotation: get results Script: annotationfind Examples: ▸ To get annotation for all samples in study 123: annotationfind -t study –id 123 ▸ Find annotation for a given lane: annotationfind -t lane -id 1234_5#6 ▸ Create a multi fasta file of all of the gryA genes for Staph: annotationfind -t species -i “Staph” –g gryA ▸ Get all the options: annotationfind -h

Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Assembly Assembly  Variant Calling Variant Calling  Sequence Tracking Sequence Tracking  Mapping Mapping  Annotation Annotation  RNA-Seq Expression RNA-Seq Expression

Pathogen Informatics 21 st Nov 2014 Mapping Pipeline Mapping Fastq split map merge bwa, smalt, stampy, bowtie2, tophat mark duplicates mark duplicates map stats picard BAM View in Artemis BAM Virus and bacteria: smalt index depends on read length: -k 13 -s 4; bp => -k 13 -s 6; >100bp => -k 20 -s 13 Eukaryotes: smalt index –k 13 –s 2 smalt map -f samsoft -i 3*insert || 1500, if eukaryote: -x –y 0.8 –r 0 reads mapped, reads paired, bases mapped, mean insert size, genome coverage, coverage depth java -jar MarkDuplicates.jar INPUT=bam OUTPUT=bam samtool s Meta- data (xls) Meta- data (xls)

Pathogen Informatics 21 st Nov 2014 Mapping: get results Script: mapfind Examples: ▸ Where is the BAM for a lane: mapfind –t lane -i 1234_5 ▸ Make a symlink to BAM (and its index file) mapfind -t lane -id 1234_5 –symlink ▸ Find all BAMs for a species: mapfind -t species -i Staph ▸ Output mapping stats to a.csv file: mapfind -t species -i Staph –results out.csv ▸ Get all the options: mapfind -h

Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Assembly Assembly  Mapping Mapping  RNA-Seq Expression RNA-Seq Expression  Sequence Tracking Sequence Tracking  Variant Calling Variant Calling  Annotation Annotation

Pathogen Informatics 21 st Nov 2014 Variant Calling Pipeline Variant Calling VCF pseudo- genome pseudo- genome VCF pseudo- genome pseudo- genome VCF pseudo- genome pseudo- genome mpileup filter pseudo- genome pseudo- genome stats samtools mpileup -d DSugBf ref bam | bcftools view –cg depth < 4, depth_strand < 2, ratio < 0.75, quality < 50, map_quality < 30, af1 < 0.95, strand_bias < 0.001, map_bias < 0.001, tail_bias < samtools mpileup -d DSug BAM

Pathogen Informatics 21 st Nov 2014 Variant Calling: get results Script: snpfind Examples: ▸ Find vcf file for a lane: snpfind -t lane -i 1234_5 ▸ Make symlink to vcf file (and its index) for a lane: snpfind -t lane -i 1234_5 -symlink ▸ Get single file with multifasta alignment of pseudogenomes from a file of lanes: snpfind -t file -i filename –p ▸ Read the usage: snpfind -h

Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Assembly Assembly  Annotation Annotation  Mapping Mapping  Variant Calling Variant Calling  Sequence Tracking Sequence Tracking  RNA-Seq Expression RNA-Seq Expression

Pathogen Informatics 21 st Nov 2014

RNASeq Expression: get results Script: rnaseqfind Examples: ▸ All directories with RNASeq results for study 1234: rnaseqfind -t study -i 1234 ▸ All spreadsheets for study 1234: rnaseqfind -t study -i f spreadsheet ▸ Coverage plots: rnaseqfind -t study -i f coverage ▸ Standalone script: rna_seq_expression -h

Pathogen Informatics 21 st Nov 2014 Pathogen Informatics Training ▸ New starters induction ▸ Getting started, basic UNIX, compute and storage ▸ Sequencing pipelines ▸ Support services provided by Pathogen Informatics ▸ Queries about location of sequencing data ▸ External software applications ▸ Small scale bespoke analysis ▸ Queries about how to use pcs/farm ▸ To arrange induction if/when join pathogen team ▸