Presentation is loading. Please wait.

Presentation is loading. Please wait.

EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists.

Similar presentations


Presentation on theme: "EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists."— Presentation transcript:

1 EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists

2 Metagenomic nucleotide sequence and annotation: Range of environments Global ocean survey Human faecal virus communities Human distal gut microbiome Phosphorus removal sludge communities Obesity-associated gut microbiome Acidophilicbacterial community Mouse gut flora

3 Metagenomic nucleotide sequence and annotation: Data growth: projects

4 Metagenomic nucleotide sequence and annotation: Data growth: volume of dataset

5 Metagenomic nucleotide sequence and annotation: Assembly issues Most metagenome records have not been assembled into scaffolds in INSDC records (only 4 of 24 projects so far) and remain as unassembled WGS records Those that have been assembled into scaffolds show very limited assembly - of the four assembled projects, one contains almost as many scaffolds as contigs

6 Metagenomic nucleotide sequence and annotation: Metadata issues Metadata, particularly sampling information, are often not shown, or are provided with limited granularity, restricting re-analysis by users INSDC offers appropriate structures for such metadata, but they are frequently not used, even when the information is available to the submitters Current: FT source 1..2866 FT /organism="marine metagenome" FT /environmental_sample FT /mol_type="genomic DNA" FT /isolation_source="isolated as part of a large dataset FT composed predominantly from surface water marine samples FT collected along a voyage from Eastern North American coast FT to the Eastern Pacific Ocean, including locations in the FT Sargasso Sea, Panama Canal, and the Galapagos Islands" FT /note="metagenomic" FT /db_xref="taxon:408172" Could be: FT source 1..2866 FT /organism="marine metagenome" FT /environmental_sample FT /mol_type="genomic DNA" FT /country="French Polynesia: Moorea, Cooks Bay" FT /lat_lon="17.476 S 149.81 W" FT /isolation_source="marine surface water; sample FT depth: 34M; size range: 0.1-0.8 microns; water FT temperature: 28.900; salinity: 35.100" FT /db_xref="taxon:408172"

7 Metagenomic nucleotide sequence and annotation: Taxonomy issues Taxonomic annotation in metagenomic data is simplistic - a very small number of non-specific taxa are necessarily used to describe all of the raw data Analysis methodology, particularly binning, is inconsistent across the dataset, so taxonomic assertions in assembled sequence are of uncertain provenance Standards on whether or not single contigs should contribute to scaffolds for more than one taxon are yet to be established

8 Metagenomes and UniProt (1/2) As of this month, ~6 million protein sequences from Global Ocean Survey have been released (vs. 4,534,260 UniProtKB entries) Future exponential increase is anticipated: The growth of public protein sequence data is exponential with a doubling time of about 20 months Metagenomics data will have substantially shorter doubling time GOS data will more than double the existing protein-coding sequences in UniProtKB

9 Metagenomes and UniProt (2/2) Perspectives Vast amount of sequence data Environmental context in metadata New kind of data requires new storage, processing, and data mining procedures Taxonomically unassigned data will not be included in the UniProt Knowledgebase UniMES – UniProt Metagenomics and Environmental sequences (June 2007)

10 UniMes requirements Distinct storage and dissemination: separated from current UniProt databases. Distinct production pipeline Distinct accession number range: MES followed by 11 hexadecimal numbers, e.g. MES00000000001 Distinct data mining pipelines: less restricted rules due to the lack of basic knowledge about the taxonomic origin of these sequences

11 UniMes pipeline overview EMBL Primary data Genomic sequence (EMBL) Other Submissions Metagenomics data (WGS) UniProt KnowledgebaseUniProt MetagenomicsUniProt Archive Classification Clustering Automatic annotation rules Secondary analysis Secondary analysis DNA Metagenomics (to be established)

12 UniProtKB vs.UniMes Database growth

13 UniMes storage growth

14 UniMes hardware requirements (1/2) 2 HP/Compaq AlphaServers ES45 with 4 1250MHz CPU’s and 12GB Memory Oracle database designed to store and maintain data derived from EMBL Oracle Warehouse for data analysis, integration and display 64-bit linux farm (AMD operon) using 40 nodes for data mining procedures

15 UniMes hardware requirements (2/2) New oracle servers: Sunfire v490 with 4 1500MHz UltraSparc IV CPUS’s and 16 GB memory We have enough physical storage and CPU power for 2007

16 UniMes dissemination FASTA and XML files UniProt Web Site: text and similarity searches

17 GOS submission Submission of nucleic acid sequence data to EMBL/GenBank/DDBJ is mandatory for publication of scientific paper Craig Venter Institute submission to EMBL/GenBank/DDBJ in March 2007 Environmental metadata can only be found in the CAMERA website Metadata are of great importance for metagenomic sequence data: Descriptions of sampling sites and habitats Analysis of metagenomics sequence data URGENT need for the community to agree on what metadata must be included with the submission of any metagenomics sample

18 UniMes and GOS data

19

20

21 Top 10 InterPro entries hitting UniProt:Top 10 InterPro entries hitting GOS Top 10 InterPro entries hitting UniParc (including GOS):

22 UniMes and GOS data: Analysis Calculation time: 763,425 CPU hours Storage for InterPro hits to GOS: 50 GB


Download ppt "EBI is an Outstation of the European Molecular Biology Laboratory. Bioinformatics Challenges in Data Handling and Presentation to the Bioinformaticists."

Similar presentations


Ads by Google