Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hubert DENISE

Similar presentations


Presentation on theme: "Hubert DENISE"— Presentation transcript:

1 Hubert DENISE

2 About me 1997 PhD. Molecular Parasitology Univ. Bordeaux II, France PostDoc, WCMP Univ. Glasgow, UK 2003 – 2005 Lecturer Molecular Biology, Univ. Clermont-Ferrand II, France Sr. Scientist, Pfizer Ltd Sandwich, UK 2011 – 2012 MSc. Bioinformatics Univ. Cranfield, UK 2012 Bioinformatician Sanger Institute then EBI, Hinxton, UK

3 Where is the true cost of NGS ? 70 % (~80 bp/$) 14.5 % 28 % (~2m bp/$) 36.5 % 14.5 % 55 % 30 % 4.5 % Sboner et al. Genome Biology (2011) 12:125

4  Philosophy  Submission to EBI Metagenomics  QC steps  Overview of functional analysis  Overview of taxonomy analysis  Metagenome assembly  Result outputs  Others public pipelines Data analysis using selected EBI and external software tools EBI Metagenomics pipeline

5 Philosophy behind EBI Metagenomics pipeline From chaos to structure:  archiving of data with metadata  performing stringent QC filtering prior to analysis  quality in, quality out  performing robust taxonomy and functional analysis  model-based rather than similarity-based approaches  assignment done on reads rather than assembly  intuitive navigation through website  constant drive to improvement  benchmarking and tool testing Helping metagenomics researchers make sense of their data

6  Philosophy  Submission to EBI Metagenomics  QC steps  Overview of functional analysis  Overview of taxonomy analysis  Metagenome assembly  Result outputs  Others public pipelines Data analysis using selected EBI and external software tools EBI Metagenomics pipeline

7 secure login Navigation panes Resource stats Latest data and news

8 Submitting to EBI Metagenomics Your data is valuable to you Raw sequence data Description of sample and experiment (sample metadata) Analysis steps and results All of this needs to be captured and stored to give context to your data If so, your data can also be valuable to others

9 Submitting to EBI Metagenomics EBI Metagenomics want to encourage people to supply as much detailed metadata as possible, but with the lowest possible overhead Development of intuitive web-based tools : ENA Webin and ISA tools Use of templates and check-lists (MIGS/MIXS standards) Tutorial and direct support where, when, whathowwho

10  Phylosophy  Submission to EBI Metagenomics  QC steps  Overview of functional analysis  Overview of taxonomy analysis  Metagenome assembly  Result outputs  Others public pipelines Data analysis using selected EBI and external software tools EBI Metagenomics pipeline

11 Image credits: (1) Christina Toft & Siv G. E. Andersson; (2) Dalebroux Z D et al. Microbiol. Mol. Biol. Rev. 2010;74: Quality control Diversity analysis Metagenomics data analysis Functional analysis

12 rRNAselector reads with rRNA reads without rRNA processed reads discarded reads trim and QC remove short remove duplicates raw reads Amplicon-based data Qiime Taxonomic analysis FragGeneScan predicted CDS InterProScan Function assignment Unknown function pCDS Overview of EBI Metagenomics Pipeline

13 EBI Metagenomics: QC rationale Why ?  Garbage in, garbage out  Base call error: - each base call has a quality score associated - specific platform-dependent errors  Reads quality decreases with reads length  NGS generates duplicate reads (false and real). Reducing duplication reduces analysis time and prevent analysis bias.

14 EBI Metagenomics: QC step by step  Clipping - low quality ends trimmed and adapter sequences removed using Biopython SeqIO package  Quality filtering - sequences with > 10% undetermined nucleotides removed  Read length filtering - short sequences are removed  Duplicate sequences removal - clustered on 99% identity (UCLUST v ) and representative sequence chosen  Repeat masking - RepeatMasker (open-3.2.2), removed reads with 50% or more nucleotides masked

15 EBI Metagenomics: QC consequences Roche 454 Illumina Ion Torrent

16 EBI Metagenomics: overview of functional analysis reads without rRNA FragGeneScan predicted CDS InterProScan Function assignment Unknown function pCDS

17 EBI Metagenomics: identification of coding sequences Prediction of coding sequences is a challenge  read length  sequencing errors: frame-shift Two main types of approaches:  homology-based methods: identify only known coding sequences  feature-based approaches: predict probability that ORFs are coding EBI Metagenomics uses FragGeneScan :  hidden Markov models to correct frame-shift using codon usage  probabilistic identification of start and stop codons  60 bp minimum ORF Rho et al. (2010) NAR 38-20

18 Most available pipelines use pairwise alignment methods (such as BLAST)  compare a query sequence with a database of sequences  identify database sequences that resemble the query sequence with homology score above a certain threshold However sequences may appear to have low homology score because:  proteins may share homology only in limited domains  proteins from different species can differ in length Example: first line of blast alignment of 60S acidic ribosomal protein P0 from 2 closely-related species EBI Metagenomics: annotation of coding sequences

19 Using BLAST for annotation

20 EBI Metagenomics: advantage of InterPro EBI Metagenomics pipeline do not use BLAST-based methods to associate functions to predicted protein sequences: instead we use InterProScan to mine the InterPro database. InterPro database (HMM and profile –based functional analysis) is based on presence of “signatures” (models) from eleven databases  Specificity : mapping is manually curated IPR024185: 5-formyltetrahydrofolate cyclo-ligase-like IPR000847: Transcription regulator HTH, LysR  Speed Test set of 40,692 predicted protein sequences  BLAST vs UniRef100 = 21.5 s/cds  InterProScan (5 databases) = 3 s/cds

21 EBI Metagenomics: InterProScan annotations pCDS member database signature accession signature description scoreInterPro accession InterPro description SRR _1_1_105_- ProSitePatterns PS00194 Thioredoxin family active site 1.0E-13 IPR017937Thioredoxin, conserved siteGO: GO annotation

22 EBI Metagenomics: InterProScan annotations signatures links GO terms description

23 Aims of the Gene Ontology Allow cross-species and/or cross-database comparisons Unify the representation of gene and gene product attributes across species Controlled vocabulary

24 English is not a very precise language Same name for different concepts Different names for the same concept Inconsistency in naming of biological concepts ? An example … Tactition Tactile sense Taction Sensory perception of touch ; GO:

25 A way to capture biological knowledge in a written and computable form The Gene Ontology A set of concepts and their relationships to each other arranged as a hierarchy Less specific concepts More specific concepts

26 The Concepts in GO 1. Molecular Function 2. Biological Process 3. Cellular Component An elemental activity or task or job protein kinase activity insulin receptor activity A commonly recognised series of events cell division Where a gene product is located mitochondrion mitochondrial matrix mitochondrial inner membrane

27 The relationship between InterPro and GO (InterPro2GO) Curators manually add relevant GO terms to InterPro entries When a sequence is searched against InterPro, it is assigned GO terms by virtue of the entries it matches SRR _1_1_133_+PfamPF00005ABC transporter 6 8.9E-6IPR003439ABC transporter-likeGO: |GO: ATP bindingATPase activity

28 EBI Metagenomics: overview of taxonomy analysis rRNAselector reads with rRNA Amplicon-based data processed reads Qiime Taxonomic analysis

29 EBI Metagenomics: identification of suitable sequences Taxonomy analysis is generally based on identification and classification of rRNA sequences  Prokaryotes: archaebacteria and eubacteria: 5S, 16S and 23S  Eukaryotes: 5S, 5.8S, 18S and 28S  there is no equivalent for virus so depend on DNA polymerase or part of 5’-UTR (internal ribosomal entry site [IRES]) sequences EBI Metagenomics currently only provide taxonomy analysis for Prokaryotes. rRNA sequences are identified using rRNASelector :  hidden Markov models to identified rRNA sequences  60 bp minimum overlap with well-curated HMM model  E-value < Lee et al (2011) J Microbiol. 49(4)

30 EBI Metagenomics: identification of suitable sequences Once identified, rRNA sequences are clustered and classified using Qiime “QIIME stands for Quantitative Insights Into Microbial Ecology. QIIME is an open source software package for comparison and analysis of microbial communities” The main steps are:  clustering sequences in Operational Taxonomy Unit (OTU) using uclust  picking a representative sequence set (one sequence from each OTU)  aligning the representative sequence set  assigning taxonomy to the representative sequence set using PyNAST  generating output files:  filtering the alignment prior to tree building  building phylogenetic tree  creating OTU table

31 EBI Metagenomics: validation of taxonomy analysis Re-analysis of: Sutton et al, Appl. Environ. Microbiol (2013), 79(2):619 Impact of Long-Term Diesel Contamination on Soil Microbial Community Structure. Alpha diversity analysis polluted clean clean (outlier)

32 Assembly of metagenomics data Metagenomics: Not clear how you avoid assembling sequences from different species together : chimaera No reference sequence to align against

33 EBI Metagenomics currently do not perform assembly We are still able to annotate metagenome as show by this re-analysis of Rumen metagenomics by Hess et al, Science (1011) 331:463 What are the consequences ?  cannot link taxonomy information to functional annotations  cannot currently perform viral taxonomy analysis

34 EBI Metagenomics pipeline in a nut shell  QC : - trim adaptor sequences, low quality sequence ends - remove duplicates and short sequences - remove low complexity sequences,  Diversity analysis : - identify prokaryotic rRNAsequences (5, 16 and 23s) - cluster rRNA-containing reads - assign taxonomy classificationusing Qiime,  Functional analysis : - predict ORFs - translate ORFs into peptides - submit to InterProScan for functional annotation “Powerful and sophisticated alternative to BLAST-based functional metagenomic analysis”

35  Submission  Philosophy  Overview data analysis  QC steps  Overview of functional analysis  Overview of taxonomy analysis  Metagenome assembly  Result outputs  Others public pipelines Data analysis using selected EBI and external software tools EBI Metagenomics pipeline

36 Current outputs of EBI Metagenomics pipeline - QC and sequence statistics - Diversity analysis - Functional analysis Visualisation Download

37 Current outputs of EBI Metagenomics pipeline Access via the Sample page navigation tabs

38 EBI Metagenomics pipeline: taxonomy visualisation switch to bar chart, column or Krona interactive views Google charts dynamic representation Krona interactive representation

39 Google charts dynamic representation switch to bar chart view links to InterPro website EBI Metagenomics pipeline: functional visualisation

40 EBI Metagenomics pipeline : download options 470 MB: need high computing power to manipulate: EBI Metagenomics take care of it and extract meaningful information sets relatively small files: can be manipulated on labtop/desktop computer: users can filtered them according to their needs

41  Submission  Philosophy  Overview data analysis  QC steps  Overview of functional analysis  Overview of taxonomy analysis  Metagenome assembly  Result outputs  Others public pipelines Data analysis using selected EBI and external software tools EBI Metagenomics pipeline

42 Quality control Metagenomics data analysis Taxonomy analysis Functional analysis Quality control Taxonomy analysis Functional analysis Pipeline 1 Pipeline 2 results 1results 2  should share trends and main findings  could differ in ratio and assignment

43 Public Metagenomics portals

44 Simplified overview of MG-RAST pipeline Sequencer outputQuality controlFeature prediction (FragGeneScan) Clustering (Uclust)Similarities search Blat Abundance profiles Community reconstruction Metabolic reconstruction Metabolic model

45 Example: Analysis of Prairie Soil Sample MG-RASTEBI Metagenomics Upload: bp Count391,415,961 bp Upload: Sequences Count946,839 Upload: Mean Sequence Length413 ± 125 bp bp Upload: Mean GC percent61 ± 8 %61.2 % Artificial Duplicate Reads: Sequence Count00 Post QC: bp Count391,415,961 bp388,670,692 bp Post QC: Sequences Count946,839908,602 Post QC: Mean Sequence Length413 ± 125 bp bp Post QC: Mean GC percent61 ± 8 %57.8 % Processed: Predicted Protein Features 972,409999,433 Processed: Predicted rRNA Features53 Alignment: Identified Protein Features 510,221480,560 Alignment: Identified rRNA Features1,0691,110 Annotation: Identified Functional Categories442,070462,475 MG-RAST and EBI Metagenomics QC comparison

46 NH 3 + A-H 2 + O 2 NH 2 OH + A + H 2 O ammonia monooxygenase: 12Ammonia monooxygenase 2ammonia monooxygenase family protein 4Ammonia monooxygenase subunit A 5Ammonia monooxygenase, putative 62Putative ammonia monooxygenase 3putative ammonia monooxygenase protein 4putative ammonia monooxygenase subunit A EBI Metagenomics: 3 IPR003393Ammonia monooxygenase/particulate methane monooxygenase, subunit A 25 IPR007820Putative ammonia monooxygenase/protein AbrB 8KEGG 18eggNOG 13GenBank 11IMG 8PATRIC 10RefSeq 12TrEMBL 9SEED MG-RAST and EBI Metagenomics Functional analysis on 8 different protein databases what do the abundance numbers mean ? MG-RAST: 28 unique hits Example: Analysis of Prairie Soil Sample 1 putative ammonia monooxygenase 3 Putative ammonia monooxygenase 5 Ammonia monooxygenase 1 ammonia monooxygenase family protein 2 ammonia monooxygenase subunit A 1 ammonia monooxygenase, putative 6 putative ammonia monooxygenase 2 Putative ammonia monooxygenase 1 putative ammonia monooxygenase subunit A 13GenBank 9SEED

47 MG-RAST and EBI Metagenomics Taxonomy analysis MG-RAST EBI Metagenomics only Archae/Bacteria taxonomy (333 OTU) (55 categories) (15 categories) (98 categories) (3 types) Example: Analysis of Prairie Soil Sample domain level of taxonomy

48 Overview of CAMERA workflow

49 Integrated Microbial Genomes and Metagenomes analysis tools

50 Some other Metagenomics tools

51 Overview of MEGAN seq comparison and assignment Functional analysis SEED KEGG COG/EGGNOG Taxonomy analysis Comparative visualisation abundance plots PCA, clustering, co-occurrence rdp,biome files csv, tsv files blast output SAM files csv, tsv files MEGAN QC ?

52 Example of taxonomy analysis using MEGAN diverse single and multi-sample visualisations

53 Example of taxonomy analysis using MEGAN Comparison, PCA and co-occurrence plots

54  Submission  Philosophy  Overview data analysis  QC steps  Overview of functional analysis  Overview of taxonomy analysis  Metagenome assembly  Result outputs  Others public pipelines Data analysis using selected EBI and external software tools EBI Metagenomics pipeline

55


Download ppt "Hubert DENISE"

Similar presentations


Ads by Google