Presentation is loading. Please wait.

Presentation is loading. Please wait.

Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for.

Similar presentations

Presentation on theme: "Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for."— Presentation transcript:

1 Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for Human Pathogen/Vector Genomic Sequences

2 Genome Sequencing Centers for Infectious Disease (GSCID) Bioinformatics Resource Centers (BRC)

3 High Throughput Sequencing Enabling technology – Epidemiology of outbreaks – Pathogen evolution – Host range restriction – Genetic determinants of virulence and pathogenicity Metadata requirements – Temporal-spatial information about isolates – Selective pressures – Host species of specimen source – Disease severity and clinical manifestations

4 Metadata Submission Spreadsheets 1111 2 2 3 3 4 44

5 Complex Query Interface

6 Metadata Inconsistencies Each project was providing different types of metadata No consistent nomenclature being used Impossible to perform reliable comparative genomics analysis Required extensive custom bioinformatics system development

7 GSC-BRC Metadata Standards Working Group NIAID assembled a group of representatives from their three Genome Sequencing Centers for Infectious Diseases (Broad, JCVI, UMD) and five Bioinformatics Resource Centers (EuPathDB, IRD, PATRIC, VectorBase, ViPR) programs Develop an approach for capturing standardized metadata for pathogen isolate sequencing projects Bottom up approach to capture data considered to be important by users Compatible with data standards and submission requirements

8 Metadata Standardization Process Collect example metadata sets from sequencing project white papers and other project sources (e.g. CEIRS) Identify data fields that appear to be common across projects and samples (core) and data fields that appear to be pathogen or project specific For each data field, provide common set of attributes, including preferred term, definition, synonyms, allowed value sets preferably using controlled vocabularies, expected syntax, etc. Assemble all metadata fields into a semantic network based on the Ontology of Biomedical Investigation (OBI) Compare, map, and harmonize to other relevant initiatives, including Genome Standards Consortium MIxS and NCBI BioProjects/BioSamples Draft data submission spreadsheets Beta test version 1.0 standard with new GSCID white paper projects, collecting feedback Adopt version 1.1 metadata standard and data submission spreadsheets for all GSCID white paper and BRC-associated projects

9 Core Project Metadata Field ID Metadata Field Descriptor OBO Foundry ID BioProject/BioSampleMIxS CP1Project Title name CP2Project ID CP3Project Description CP4Supporting Grants/Contract ID Agency CP5Publication Citation IDref_biomaterial CP6 Sample Provider Principal Investigator (PI) Name CP7Sample Provider PI's Institution CP8Sample Provider PI's email CP9Sequencing Facility CP10Sequencing Facility Contact Name CP11Sequencing Facility Contact's Institution CP12Sequencing Facility Contact's email CP13Bioinformatics Resource Center CP14Bioinformatics Resource Center Contact Name CP15 Bioinformatics Resource Center Contact's Institution CP16Bioinformatics Resource Center Contact's email CP17Target Material Material CP18Project Method Methodology CP19Project Objectives Objective CP20Sample Scope CP21Target Capture Capture

10 Core Sample Metadata Field ID Metadata Field Descriptor OBO Foundry ID NCBI BioSample MIxS CS1Specimen Source ID CS2Specimen Source Species CS3Species Source Common Name host-common-namehost_common_name CS4Specimen Source Gender CS5Specimen Source Age - Value CS6Specimen Source Age - Unit CS7Specimen Source Health Status status CS8Specimen Collection Date date CS9Specimen Collection Location - Latitude location (lat and long) CS10Specimen Collection Location - Longitude location (lat and long) CS11Specimen Collection Location - Location CS12Specimen Collection Location - Country location (country and/or sea) CS13Specimen ID name CS14Specimen Type habitat, body site, body product CS15Suspected Organism(s) in Specimen - Species CS16 Suspected Organism(s) in Specimen - Subclass strainsubspecific genetic lineage CS17 Human Pathogenicity of Suspected Organism(s) in Specimen phenotype CS18Environmental Material (material) CS19Organism Detection Method sample collection device or method CS20Specimen Repository culture-collectionsource material identifiers CS21Specimen Repository Sample ID culture-collectionsource material identifiers CS22Sample ID - Sequencing Facility CS23Nucleic Acid Extraction Method material processing CS24Nucleic Acid Preparation Method samp_mat_processsample material processing CS25Sequencing Method sequencing method CS26Assembly Algorithm assembly CS27Depth of Coverage - Average finishing strategy CS28Annotation Algorithm CS29GenBank Record ID CS30Comments CS31Specimen Collector Name collected-by CS32Specimen Collector's Institution CS33Specimen Collector's email CS34Sample Category attribute_package CS35Host Disease host-disease

11 Metadata Processes data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeID qualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in temporal-spatial region located_in quality assessment assay Host Characterization has_input has_output

12 organism environmental material equipment person specimen source role specimen capture role specimen collector role temporal-spatial region spatial region temporal interval GPS location date/time specimen X specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_in name denotes spatial region geographic location denotes located_in affiliation has_affiliation ID denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input organism part hypothesis is_about IRB/IACUC approval has_authorization environment has_quality organism pathogenic disposition has part has disposition ID denotes CS1 genderagehealth status has quality CS4CS5/6CS7 CS2/3 CS8 CS9/10 CS11/12 CS13 CS14 CS18 CS15/16

13 Core Project Semantics

14 Outcome of Metadata Standards WG Consistent metadata captured across GSCID Bottom up approach focuses standard on important features Support more standardized BRC interface development Harmonization with related stakeholders – Genome Standards Consortium MIxS, OBO Foundry OBI and NCBI BioProject/BioSample Represented in the context of an extensible semantic framework

15 Identified gaps in data field list (e.g. temporal components) Includes logical structure for other, project-specific, data fields - extensible Identified gaps in ontology data standards (use case- driven standard development) Identified commonalities in data structures (reusable) Support for semantic queries and inferential analysis in future Ontology-based framework is extensible – Sequencing => “omics” Utility of semantic representation

16 Acknowledgements Bruce Birren 2,b, Lauren Brinkac 1,a, Vincent Bruno 3,c, Elizabeth Caler 1,a, Ishwar Chandramouliswaran 1,a, Sinéad Chapman 2,b, Frank Collins 8,h, Christina Cuomo 2,b, Joana Carneiro Da Silva 3,c, Valentina Di Francesco 4, Vivien Dugan 1,a, Scott Emrich 8,h, Mark Eppinger 3,c, Michael Feldgarden 2,b, Claire Fraser 3,c, W. Florian Fricke 3,c, Maria Giovanni 4, Gloria Giraldo-Calderon 8,h, Omar S. Harb 5,g, Matt Henn 2,b, Erin Hine 3,c, Julie Dunning Hotopp 3,c, Jessica C. Kissinger 6,g, Eun Mi Lee 4, Punam Mathur 4, Garry Myers 3,c, Emmanuel Mongodin 3,c, Cheryl Murphy 2,b, Dan Neafsey 2,b, Karen Nelson 1,a, Ruchi Newman 2,b, William Nierman 1,a, Brett E. Pickett 1,d,e, Julia Puzak 4, David Rasko 3,c, David S. Roos 5,g, Lisa Sadzewica 3,c, Richard H. Scheuermann 1,d,e, Lynn M. Schriml 3,c, Bruno Sobral 7,f, Tim Stockwell 1,a, Chris Stoeckert 5,g, Dan Sullivan 7,f, Luke Tallon 3,c, Herve Tettelin 3,c, Doyle V. Ward 2,b, David Wentworth 1,a, Owen White 3,c, Rebecca Will 7,f, Jennifer Wortman 2,b, Alison Yao 4, Jie Zheng 5,g 1 J. Craig Venter Institute, Rockville, MD and San Diego, CA, 2 Broad Institute, Cambridge, MA, 3 Insitute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 4 National Institute of Allergy and Infectious Diseases, Rockville, MD, 5 University of Pennsylvania, Philadelphia, PA, 6 University of Georgia, Athens, GA, 7 Cyberinfrastructure Division, Virginia Bioinformatics Institute, Blacksburg, VA, 8 University of Notre Dame, South Bend, IN, a J. Craig Venter Institute Genome Sequencing Center for Infectious Diseases, b Broad Institute Genome Sequencing Center for Infectious Diseases, c Institute for Genome Sciences Genome Sequencing Center for Infectious Diseases, d Influenza Research Database Bioinformatics Resource Center, e Virus Pathogen Resource Bioinformatics Resource Center, f PATRIC Bioinformatics Resource Center, g EuPathDB Bioinformatics Resource Center, h VectorBase Bioinformatics Resource Center Tanya Barrett – NCBI Pelin Yilmaz – Genome Standards Consortium N01AI2008038 /N01AI40041

Download ppt "Richard H. Scheuermann, Ph.D. Director of Informatics J. Craig Venter Institute On behalf of the GSC-BRC Metadata Working Group Standardized Metadata for."

Similar presentations

Ads by Google