Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics.

Similar presentations


Presentation on theme: "Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics."— Presentation transcript:

1 Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics

2 Pathogen Informatics 21 st Nov 2014 Pathogen Informatics Team ▸ Team of 9 software developers and bioinformaticians Carla Cummins

3 Pathogen Informatics 21 st Nov 2014 Role of Pathogen Informatics ▸ Informatics support to pathogen variation programme ▸ Dougan, Lawley, Parkhill, Berriman, Thomson & Kellam faculty teams ▸ Researchers, visiting workers, collaborators ▸ Approx. 120 people ▸ Applications and systems to support research activities ▸ Automated pipelines for sequence tracking and analysis ▸ Ad-hoc bioinformatics support and training

4 Pathogen Informatics 21 st Nov 2014 Cumulative Number of Tbp Sequenced 79.1 57.3

5 Pathogen Informatics 21 st Nov 2014 Cumulative Number of Samples Sequenced 107K 85K

6 Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  Assembly Assembly  Annotation Annotation  Mapping Mapping  Variant Calling Variant Calling  RNA-Seq Expression RNA-Seq Expression  QC QC  Sequence Tracking Sequence Tracking

7 Pathogen Informatics 21 st Nov 2014 Pathogen Tracking and Import Pipeline ▸ Cron regularly checks iRODs for new sequencing data ▸ Populate pathogen tracking database with metadata ▸ iRODS, warehouse ▸ Only update lanes where NPG QC is complete ▸ Converts bams to fastq and store on disk ▸ Convert bax5 files to fastq and store on disk Warehouse (sequencing informatics) Warehouse (sequencing informatics) iRODS (sequencing informatics) iRODS (sequencing informatics) Pathogen Tracking Pathogen Tracking Pathogen Disk Pathogen Disk cron Sequencescape (sequencing informatics) Sequencescape (sequencing informatics) Changes made in warehouse/iRODs (~24 hours) Register study Request sequencing Change meta-data

8 Pathogen Informatics 21 st Nov 2014 Finding Data Script: pathfind Examples: ▸ Where is the FASTQ for a lane: pathfind -t –id 1234_5 ▸ Make a symlink to FASTQ pathfind -t –id 1234_5 –symlink ▸ Find all FASTQs for a species: pathfind -t species -i Staph ▸ Output lane stats to a.csv file: pathfind -t species -i Staph –results out.csv ▸ Get all the options: pathfind -h

9 Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  Assembly Assembly  Annotation Annotation  Mapping Mapping  Variant Calling Variant Calling  RNA-Seq Expression RNA-Seq Expression  Sequence Tracking Sequence Tracking  QC QC

10 Pathogen Informatics 21 st Nov 2014 QC Pipeline ▸ Align 100MB to reference with bwa ▸ Generate QC stats ▸ Basic statistics on fastq e.g. yield ▸ Percent reads/bases mapped ▸ Percent genome covered ▸ Error rate ▸ Create QC Plots ▸ GC plot vs. reference GC ▸ Insert size distribution ▸ Base quality ▸ Coverage ▸ Run Kraken ▸ assigns taxonomic labels to short DNA sequences ▸ Results presented through QCGrind web interface

11 Pathogen Informatics 21 st Nov 2014 QCGrind

12 Pathogen Informatics 21 st Nov 2014 Kraken Results Script: qcfind Examples: ▸ Where is the kraken report for a lane: qcfind –t lane -i 1234_5 ▸ Where are the kraken reports for a study: qcfind –t study –i 3249

13 Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Annotation Annotation  Mapping Mapping  Variant Calling Variant Calling  RNA-Seq Expression RNA-Seq Expression  Sequence Tracking Sequence Tracking  Assembly Assembly

14 Pathogen Informatics 21 st Nov 2014 Assembly Pipeline ▸ Bacteria samples are assembled automatically ▸ Virus samples are assembled automatically on a study by study basis ▸ Eukaryote samples are assembled on a per request basis ▸ PacBio samples are assembled automatically using HGAP

15 Pathogen Informatics 21 st Nov 2014 Assembly Pipeline

16 Pathogen Informatics 21 st Nov 2014 Assembly Pipeline

17 Pathogen Informatics 21 st Nov 2014 Assembly: get results Script: assemblyfind Examples: ▸ Create symlinks to all the final assemblies in the given study assemblyfind -t study -id "My study" –symlink ▸ Find an assembly for a given lane assemblyfind -t lane -id 1234_5#6 ▸ Make a.csv file of assembly stats for a given species: assemblyfind -t species -i "Leishmania donovani” -stats ▸ Get all the options: assemblyfind -h

18 Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Assembly Assembly  Mapping Mapping  Variant Calling Variant Calling  RNA-Seq Expression RNA-Seq Expression  Sequence Tracking Sequence Tracking  Annotation Annotation

19 Pathogen Informatics 21 st Nov 2014 Annotation Pipeline ▸ Run automatically on all Bacteria denovo assemblies (also works for Viruses) ▸ Can be run in standalone mode annotate_bacteria ▸ Annotation ready for submission to EMBL/Genbank ▸ Pipeline Steps ▸ Genes predicted with Prodigal ▸ RNA predicted with Infernal ▸ The databases are searched in the following order: ▸ Genus specific RefSeq databases ▸ UniprotKB – bacteria/virus databases ▸ Conserved domain database ▸ pfam (A) ▸ rfam

20 Pathogen Informatics 21 st Nov 2014 Annotation: get results Script: annotationfind Examples: ▸ To get annotation for all samples in study 123: annotationfind -t study –id 123 ▸ Find annotation for a given lane: annotationfind -t lane -id 1234_5#6 ▸ Create a multi fasta file of all of the gryA genes for Staph: annotationfind -t species -i “Staph” –g gryA ▸ Get all the options: annotationfind -h

21 Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Assembly Assembly  Variant Calling Variant Calling  Sequence Tracking Sequence Tracking  Mapping Mapping  Annotation Annotation  RNA-Seq Expression RNA-Seq Expression

22 Pathogen Informatics 21 st Nov 2014 Mapping Pipeline Mapping Fastq split map merge bwa, smalt, stampy, bowtie2, tophat mark duplicates mark duplicates map stats picard BAM View in Artemis BAM Virus and bacteria: smalt index depends on read length: -k 13 -s 4; 70-100bp => -k 13 -s 6; >100bp => -k 20 -s 13 Eukaryotes: smalt index –k 13 –s 2 smalt map -f samsoft -i 3*insert || 1500, if eukaryote: -x –y 0.8 –r 0 reads mapped, reads paired, bases mapped, mean insert size, genome coverage, coverage depth java -jar MarkDuplicates.jar INPUT=bam OUTPUT=bam samtool s Meta- data (xls) Meta- data (xls)

23 Pathogen Informatics 21 st Nov 2014 Mapping: get results Script: mapfind Examples: ▸ Where is the BAM for a lane: mapfind –t lane -i 1234_5 ▸ Make a symlink to BAM (and its index file) mapfind -t lane -id 1234_5 –symlink ▸ Find all BAMs for a species: mapfind -t species -i Staph ▸ Output mapping stats to a.csv file: mapfind -t species -i Staph –results out.csv ▸ Get all the options: mapfind -h

24 Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Assembly Assembly  Mapping Mapping  RNA-Seq Expression RNA-Seq Expression  Sequence Tracking Sequence Tracking  Variant Calling Variant Calling  Annotation Annotation

25 Pathogen Informatics 21 st Nov 2014 Variant Calling Pipeline Variant Calling VCF pseudo- genome pseudo- genome VCF pseudo- genome pseudo- genome VCF pseudo- genome pseudo- genome mpileup filter pseudo- genome pseudo- genome stats samtools mpileup -d 1000 -DSugBf ref bam | bcftools view –cg depth < 4, depth_strand < 2, ratio < 0.75, quality < 50, map_quality < 30, af1 < 0.95, strand_bias < 0.001, map_bias < 0.001, tail_bias < 0.001 samtools mpileup -d 1000 -DSug BAM

26 Pathogen Informatics 21 st Nov 2014 Variant Calling: get results Script: snpfind Examples: ▸ Find vcf file for a lane: snpfind -t lane -i 1234_5 ▸ Make symlink to vcf file (and its index) for a lane: snpfind -t lane -i 1234_5 -symlink ▸ Get single file with multifasta alignment of pseudogenomes from a file of lanes: snpfind -t file -i filename –p ▸ Read the usage: snpfind -h

27 Pathogen Informatics 21 st Nov 2014 Sequence Analysis Pipelines  QC QC  Assembly Assembly  Annotation Annotation  Mapping Mapping  Variant Calling Variant Calling  Sequence Tracking Sequence Tracking  RNA-Seq Expression RNA-Seq Expression

28 Pathogen Informatics 21 st Nov 2014

29 RNASeq Expression: get results Script: rnaseqfind Examples: ▸ All directories with RNASeq results for study 1234: rnaseqfind -t study -i 1234 ▸ All spreadsheets for study 1234: rnaseqfind -t study -i 1234 -f spreadsheet ▸ Coverage plots: rnaseqfind -t study -i 1234 -f coverage ▸ Standalone script: rna_seq_expression -h

30 Pathogen Informatics 21 st Nov 2014 Pathogen Informatics Training ▸ New starters induction ▸ Getting started, basic UNIX, compute and storage ▸ Sequencing pipelines ▸ Support services provided by Pathogen Informatics ▸ Queries about location of sequencing data ▸ External software applications ▸ Small scale bespoke analysis ▸ Queries about how to use pcs/farm ▸ To arrange induction if/when join pathogen team ▸ Email path-help@sanger.ac.uk


Download ppt "Pathogen Informatics 21 st Nov 2014 Pathogen Sequencing Informatics Jacqui Keane Pathogen Informatics."

Similar presentations


Ads by Google