Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to the CGE servers

Similar presentations


Presentation on theme: "Introduction to the CGE servers"— Presentation transcript:

1 Introduction to the CGE servers

2 Center for Genomic Epidemiology
Aim: To provide the scientific foundation for future internet-based solutions, where a central database will enable simplification of total genome sequence information and comparison to all other sequenced isolates including spatial-temporal analysis. To develop algorithms for rapid analyses of whole genome DNA-sequences, tools for analyses and extraction of information from the sequence data and internet/web-interfaces for using the tools in the global scientific and medical community. 

3

4 Tools for species identification
Name of Service Description URL (cge.cbs.dtu.dk/services/) Status Publication SpeciesFinder Species identification using 16S rRNA Online Published Feb 2014 PMID: KmerFinder Species identification using overlapping 16mers Published Jan 2014 PMID: TaxonomyFinder Taxonomy identification using functional protein domains Published in PMID: Oksana's PhD thesis Reads2Type Species identification on client computer

5 Benchmarking of Methods for Bacterial Species Identification
PMID:

6 Training data Evaluation data
1,647 completed / almost completed genomes downloaded from NCBI in 2011 (1,009 different species) Evaluation data NCBI draft genomes 695 isolates from species that overlap with training set (151 species) SRA draft genomes 10,407 sets of short reads from Illumina (168 species) 10,407 draft genomes from Illumina data (168 species)

7 16S rRNA 16S rRNA sequencing has dominated molecular taxonomy of prokaryotes for more than 30 years (Fox et al, Int. J. Syst. Bacteriol., 1977) Tremendous amounts of 16S rRNA sequence data are available in databases Concerns: Low resolution Some genomes contain several copies of the 16S rRNA gene with inter-gene variation The 16S rRNA gene represents only about 0.1% of the coding part of a microbial genome

8 CGE implementation of 16S species identification SpeciesFinder
Reference database 16S rRNA genes are isolated from genomes in training data using RNAmmer (Lagesen, NAR, 2007). Method Input genomes are BLASTed against 16S rRNA genes in reference database. Best hit is selected based on a combination of coverage, % identity, bitscore, number of mistmatches and number of gaps in the alignments. BLAST will not work isolating the 16S RNA gene. RNAmmer is based on a hidden markov model BLAST hits is based on a combination of coverage length, identity and bit score

9 KmerFinder Genomes in training data is chopped into 16mers: 9mer
A T G A C G T A T G A T T G A T G A C G T A G T A G T C C 9mer Immune system inspired downsampling Only 16mers with specific prefix are kept The human allele HLA-A:02:01 prefers leucine on position 2 and valine on position 9. If all amino acids were equally frequent, restriction by this motif would make it bind 1 out of 400 peptides. The fixation of the 2 anchor positions of a 9mer peptide by MHC still makes the peptide as selective as any other 9mer peptide but only a fraction of the 9mer peptides have the right anchors. As a result of this binding specificity, the immune system does not recognize the entire proteome of a microbe but only subset of it. The “microbe database” that is actually remembered by the immune system can be ~200 times smaller than it would otherwise be. This may be important, since there are only ~10^12 cells in the immune system and without a reduction in the microbe database, the number of cells may be insufficient to save information about all the microbes we encounter throughout our lives. ATGA is the prefix used in this example. A database is generated where each 16mer is a key and the value is a list of all the isolates in the trainingset containing this 16mer MHC-I

10 CP001921 (Acinetobacter baumanii) CP000521 (Acinetobacter baumanii)
16mer database CP (Acinetobacter baumanii) CP (Acinetobacter baumanii) CP (Acinetobacter baumanii) ATGAATGTGTGAGTGA CP (Acinetobacter baumanii) CP (Buchnera aphidicola) ATGACTGTGCCCCTGA Unknown isolate Species Match No. of Kmer hits Acinetobacter baumannii CP001921 2 CP000521 1 CP002521 Buchnera aphidicola CP002301 Unique 16 mers: A database is generated where each 16mer is a key and the value is a list of all the isolates in the trainingset containing this 16mer Very robust method - it just needs one 16mer to make a prediction. ATGAATGTGTGAGTGA ATGACTGTGCCCCTGA ATGAAAAAAAAAAAA

11 KmerFinder is very robust – it only needs one 16mer!
Desulfovibrio piger GOR1 SRR097356 >NODE 4 length 92 cov TAGGACGTGGAATATGGCAAGAAAACTGAAAATCATGGAAAATGAGAAACATCCACTTGA CGACTTGAAAAATGACGAAATCACTAAAAAACGTGAAAAATGAGAAATGC >NODE 15 length 82 cov AGCGAAAAATGTCATAACAACGATCACGACCGATAACCATCTTTGGTCCAAACTTACTCA CGCAGCAGGCGTATAACTCGCGCATACCAGCTTTGGGCAT N50 = 110 Total no. of bp: 210 For 41 isolates, the method failed to produce an output. The 41 draft genomes typically had an N50 below 200 (average N50 = 155) and total no. of bp below Prediction Species Match No. of Kmer hits Flavobacterium psycrophilum AM398681 1

12 TaxonomyFinder

13 Reads2Type Definition: Quick & dirty taxonomy identification of single isolates 50-mer of marker gene DB 16S rRNA: Training data genomes  RNAmmer (other) ITS: Training data (Mycobacterium) GyrB: Training data (Enterobacteriaceae) Resulting database ~5 MB Read2Type pushes analysis to user, server provides 50-mers database SuffixTree: efficient data structure for string matching Narrow Down Approach: Reads2Type compares 50-mers of combined marker genes against raw reads Shared Probes vs Unique Probe

14 rMLST Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM, Wimalarathna H, Harrison OB, Sheppard SK, Cody AJ, Maiden MC. Ribosomal multilocus sequence typing: universal characterization of bacteria from domain to strain. Microbiology. 2012 Apr;158(Pt 4): CGE implementation For each genome in the training data the 53 ribosomal genes were extracted. Genomes in evaluation sets were aligned using blat to each gene collection (only hits with at least 95% identity and 95% coverage were considered as a potential match). The closets match of the training genomes was selected based on a combination of coverage, %identity, bitscore, number of mistmatches and number of gaps in the alignments across all genes. Average N50 of 1329 for failed isolates

15 Results (16s rRNA) On the SRA drafts set, rMLST is not able to make a prediction for 3.5% of the isolates, TaxonomyFinder 1.8%, KmerFinder 0.4%, SpeciesFinder 0.2%

16 Overlap in predictions
One of the six isolates that all methods agree are not correctly annotated has actually been re-annotated since we downloaded them

17 Isolates in the NCBIdrafts set for which all four methods predict the species to be different from the annotated one. * NZAEPO has been re-annotated as S. oralis since we downloaded the data.

18 All four methods agree that 2 of the B. cereus is B. weihenstephanensis

19 Bacillus cereus predicted to be B. thuringiensis is problematic
Bacillus cereus predicted to be B. thuringiensis is problematic. Likewise recently diverged: Y pestis <> Y pseudotuberculosis, M. tuberculosis <> M. bovis,

20 Speed Method Estimated speed (mm:ss) 16S 00:13* KmerFinder 00:09*
TaxonomyFinder 11:33* rMLST 00:45* Reads2Type 00:55** *Estimation based on draft genomes **Estimation based on short reads

21 Summary of taxonomy benchmark study
KmerFinder had the highest accuracy and was the fastest method. SpeciesFinder (16S rRNA-based) had the lowest accuracy. Methods that only sample genomic loci (16S, Reads2Type, rMLST) had difficulties distin- guishing species that only recently diverged, especially when main difference is a plasmid. Recently diverged: Y pestis <> Y pseudotuberculosis, M. tuberculosis <> M. bovis,

22

23

24 Tools for further typing
Name of Service Description URL ( ) Publication MLST Multilocus sequence typing Published Apr 2012, PMID: Plasmid-Finder Identification of plasmids in Enterobacteriaceae PlasmidFinder Published Apr 2014, PMID pMLST pMLST of plasmids in Enterobacteriaceae

25 Multilocus Sequence Typing (MLST)
First developed in 1998 for Neisseria meningitis (Maiden et al. PNAS : ) The nucleotide sequence of internal regions of app. 7 housekeeping genes are determined by PCR followed by Sanger sequencing Different alleles are each assigned a random number The unique combination of alleles is the sequence type (ST)

26 Using WGS data for MLST

27 www.cbs.dtu.dk/services/MLST Acinetobacter baumannii #1
Arcobacter Borrelia burgdorferi Bacillus cereus Brachyspira hyodysenteriae Bifidobacterium Brachyspiria intermedia Bordetella Burkholderia pseudomallei Brachyspira Burkholeria cepacia complex Campylobacter jejuni Clostridium botulinum Clostridium difficile #1 Clostridium difficile #2 Campylobacter helveticus Campylobacter insulaenigrae Clostridium septicum C. diphtheriae Campylobacter fetus Chlamydiales Campylobacter lari Cronobacter C. upsaliensis Escherichia coli #1 Escherichia coli #2 Enterococcus faecalis Enterococcus faecium F. psychrophilum Haemophilus influenzae Haemophilus parasuis Helicobacter pylori Klebsiella pneumoniae Lactobacillus casei Lactococcus lactis Leptospira Listeria Listeria monocytogenes Moraxella catarrhalis Mannheimia haemolytica Neisseria P. gingivalis P. acne Pseudomonas aeruginosa Pasteurella multocida Staphylococcus aureus Streptococcus agalactiae Salmonella enterica Staphylococcus epidermidis S. maltophilia Streptococcus pneumoniae Streptococcus oralis S. zooepidemicus Streptococcus pyogenes Streptococcus suis Streptococcus thermophilus Streptomyces Streptococcus uberis Vibrio parahaemolyticus Vibrio vulnificus Wolbachia Xylella fastidiosa Y. pseudotuberculosis Assembled genome 454 – single end reads 454 – paired end reads Illumina – single end reads Illumina – paired end reads Ion Torrent SOLiD – single end reads SOLiD – mate pair reads

28

29 Extended Output

30 Extended Output aro: WARNING, Identity: 100%, HSP/Length: 349/498, Gaps: 0, aro_122 is the best match for aro

31 What is the MLST web-service used for?
Our most used service - In the first 9 month of 2014, it was on average used more than 1,500 times per month. From Sep – Oct. 2014, the service was used more than 20,000 times in total.

32 PlasmidFinder and pMLST
The PlasmidFinder database contains replicons, not entire plasmids.

33 (https://cge.cbs.dtu.dk/services/ )
Tools for phenotyping Name of Service Description URL ( ) Publication ResFinder Identification of acquired antibiotic resistance genes Published Nov 2012, PMID: Virulence-Finder Identification of virulence genes in E. coli (and S. aureus and Enterococcus) VirulenceFinder E. coli published Feb 2014, PMID: MyDbFinder Identification of genes from the users own database Will be published in book chapter Pathogen-Finder Prediction of pathogenic potential PathogenFinder Published Oct 2013, PMID:

34 Theoretical resistance phenotype
ResFinder ResFinder (BLAST) NGS Illumina Ion torrent 454.. Assembly pipeline Resistance gene profile List of genes Accession numbers Theoretical resistance phenotype Sanger Fasta Fasta Sanger

35

36 From S. aureus

37 ResFinder, 98 %ID, 60% length coverage
200 isolates from 4 different species (Salmonella Typhimurium, Escherichia coli, Enterococcus faecalis and Enterococcus faecium) ResFinder, 98 %ID, 60% length coverage Phenotypic tests, 3,051 in total 482 Resistant 2569 Susceptible => 99,74% of the results were in agreement between ResFinder and the phenotypic tests 23 discrepancies -> 16, typically in relation to spectinomycin in E. coli

38 Alternatives to ResFinder

39 Unpublished or uncategorized
Name of Service Description URL ( ) Status Publication PanFunPro Groups homologous proteins based on functional domain content Online Published in F1000Research 2013, 2:265 Serotype-Finder Identification of serotypes SerotypeFinder-1.0 Not yet published Restriction-ModificationFinder Identification of RM system genes Will only be published in book chapter HostPhinder Prediction of the host of a bacteriophage Online, but under development MetaVir-Finder Identification of virus in metegenomic data MetaVirFinder MGmapper Identifies the content of metagenomic samples

40 Tools for phylogeny Name of Service Description Status Publication
URL (cge.cbs.dtu.dk/services) Status Publication SnpTree Creation of phylogenetic trees based on SNPs snpTree Online Published Dec 2012, PMID: CSIPhylo-geny CSIPhylogeny Planned NDtree Creation of phylogenetic trees Published in Feb 2014, PMID:

41 Web-service usage

42 Type of data uploaded to MLST web-service

43


Download ppt "Introduction to the CGE servers"

Similar presentations


Ads by Google