SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata.

Slides:



Advertisements
Similar presentations
GSC-BRC Metadata Standards Richard H. Scheuermann U.T. Southwestern Medical Center.
Advertisements

Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical.
Systems Biology Data Dissemination Working Group 25FEB2015.
Introduction to Bioinformatics Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.
Coordinating Center Overview November 18, 2010 SPECIAL DIABETES PROGRAM FOR INDIANS Healthy Heart Project Initiative: Year 1 Meeting 1.
DEVA Data Management Workshop Devil’s Hole Pupfish Project Data Management Workshop Devil’s Hole Pupfish Program Death Valley National Park Introduction.
Annotating Documents for the Semantic Web Using Data-Extraction Ontologies Dissertation Proposal Yihong Ding.
BTRIS: The NIH Biomedical Translational Research Information System James J. Cimino Chief, Laboratory for Informatics Development NIH Clinical Center.
S&I Framework Doug Fridsma, MD, PhD Director, Office of Standards and Interoperability, ONC Fall 2011 Face-to-Face.
BIS TDWG Conference 28 October 2013, Florence Documenting data quality in a global network: the challenge for GBIF Éamonn Ó Tuama, Andrea Hahn, Markus.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
1 FACS Data Management Workshop The Immunology Database and Analysis Portal (ImmPort) Perspective Bioinformatics Integration Support Contract (BISC) N01AI40076.
Richard H. Scheuermann, Ph.D. Department of Pathology Division of Biomedical Informatics U.T. Southwestern Medical Center Standardizing Metadata Associated.
Cis-Regulatory/ Text Mining Interface Discussion.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
RDA Wheat Data Interoperability Working Group Outcomes RDA Outputs P5 9 th March 2015, San Diego.
RDA Wheat Data Interoperability Working Group Outcomes RDA Outputs P5 9 th March 2015, San Diego.
Standardizing Metadata Associated with NIAID Genome Sequencing Center Projects and their Implementation in NIAID Bioinformatics Resource Centers Richard.
Data Requirements for Field Release and Monitoring Jon Knight Imperial College London
MEASUREMENT PLAN SOFTWARE MEASUREMENT & ANALYSIS Team Assignment 15
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
OBI – Communities and Structure 1. Coordination Committee (CC): Representatives of the communities -> Monthly conferences 2. Developers WG: CC and other.
Ontologies for Web Service Annotations OBI & EDAM Dr. Jessica Kissinger Department Of Genetics University Of Georgia 1.
Data Analysis Summary. Elephant in the room General Comments General understanding that informatics is integral in medical sequencing and other –omics.
Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life UC DAVIS Department of Computer Science The Kepler/pPOD Team Shawn.
DAN LAWSON BRC 2011 – ANNUAL MEETING UT SOUTHWESTERN MEDICAL CENTER DALLAS, TX SEPTEMBER 2011 Challenges and opportunities of new sequencing technologies.
Leveraging Ontologies for Human Immunology Research Barry Smith, Alexander Diehl, Anna- Maria Masci Presented at Leveraging Standards and Ontologies to.
Linking Tasks, Data, and Architecture Doug Nebert AR-09-01A May 2010.
University of Michigan Medical School 1 Towards a Semantic Web application: Ontology-driven ortholog clustering analysis Yu Lin, Zuoshuang Xiang, Yongqun.
Richard H. Scheuermann, Ph.D. November 5, 2012 Support for Systems Biology Data in IRD/ViPR - Proteomics.
IS 325 Notes for Wednesday August 28, Data is the Core of the Enterprise.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Data Integration and Management A PDB Perspective.
BRC 2011 Session #4 – “Omics” Data. Session #4 - Outline Challenges and Opportunities  pathogen datasets; host datasets; integrating pathogen-host datasets.
Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.
Enabling complex queries to drug information sources through functional composition Olivier Bodenreider Lister Hill National Center for Biomedical Communications.
Variation data in VectorBase NIH/NIAID VectorBase site visit March 2015.
Mining the Biomedical Research Literature Ken Baclawski.
Databases, Ontologies and Text mining Session Introduction Part 2 Carole Goble, University of Manchester, UK Dietrich Rebholz-Schuhmann, EBI, UK Philip.
Web Technologies for Bioinformatics Ken Baclawski.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.
Influenza Ontology Infectious Disease Ontology Workshop 2008 Burke Squires.
Habitat-Lite & EnvO Jin Mao Postdoc, School of Information, University of Arizona Nov. 20, 2015.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
No Longer Under Our Control? The Nature and Role of Standards in the 21 st Century Library William E. Moen School of Library and Information Sciences Texas.
Efforts to Link Ecological Metadata with Bacterial Gene Sequences at the Sapelo Island Microbial Observatory Wade M. Sheldon Mary Ann Moran James T. Hollibaugh.
National Geospatial Enterprise Architecture N S D I National Spatial Data Infrastructure An Architectural Process Overview Presented by Eliot Christian.
Ontology Driven Data Collection for EuPathDB Jie Zheng, Omar Harb, Chris Stoeckert Center for Bioinformatics, University of Pennsylvania.
High throughput biology data management and data intensive computing drivers George Michaels.
ISWG / SIF / GEOSS OOSSIW - November, 2008 GEOSS “Interoperability” Steven F. Browdy (ISWG, SIF, SCC)
Session 2: Developing a Comprehensive M&E Work Plan.
ISWG / SIF / GEOSS OOS - August, 2008 GEOSS Interoperability Steven F. Browdy (ISWG, SIF, SCC)
VIEWS b.ppt-1 Managing Intelligent Decision Support Networks in Biosurveillance PHIN 2008, Session G1, August 27, 2008 Mohammad Hashemian, MS, Zaruhi.
Statistical process model Workshop in Ukraine October 2015 Karin Blix Quality coordinator
Considerations for Regional Data Collection, Sharing and Exchange Bruce Schmidt StreamNet Program Manager Pacific States Marine Fisheries Commission Presentation.
Informatics for Scientific Data Bio-informatics and Medical Informatics Week 9 Lecture notes INF 380E: Perspectives on Information.
The Components of Information Systems
Databases, Ontologies and Text mining Session Introduction Part 2
HIV Drug Resistance Training
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
Bringing Organism Observations Into Bioinformatics Networks
The Components of Information Systems
Database Design Hacettepe University
Metadata The metadata contains
Bird of Feather Session
  1-A) How would Arctic science benefit from an improved GIS?
Module 1.1 Overview of Master Facility Lists in Nigeria
Presentation transcript:

SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata

Session #5 - Outline Motivation Opportunities, Challenges and Talking Points  minimum information checklists  ontology-based value sets  use cases for metadata  SOPs for data & metadata acquisition Ontology of Biomedical Investigations – Bjoern Peters Infectious Disease Ontology and extensions – Lindsay Cowell GSCID-BRC Metadata Working Group efforts Open discussion

Why Data Standards Interoperability - the ability to exchange information between people, organizations, machines Comparability - the ability to ascertain the equivalence of data from different sources Data Quality – asses the completeness, accuracy and precision of the data Dependability – ensures that you get what you expect from a database query Accurate Statistical Analysis Inference

What Data Standards Minimum Information Sets – what needs to be described Structured Vocabulary/Ontology – how to describe them  Term strings – unique identifiers  Definitions - what terms mean  Syntax - how terms are used Semantics - how the components relate to each other

Session #5 – Challenges Status of relevant data standards  Few data standards that have been widely adopted by the infectious diseases community  Some standards are being development without engagement of all relevant stakeholders  If we drive standards development, how do we get broad adoption Adoption of data standards by data providers  Even if vocabulary standards are available, how do we get the broader community to use them  How do we educate them to use the data standards accurately  How to keep the barrier low for getting required meta-data in a standard format Technical challenges  Usability is constrained by spreadsheet interface  Ontology-based controlled vocabularies sometimes too large for spreadsheet like interface or drop down lists  While web-based GUI smart forms are good for single submission, difficult to design them to scale Need for quality control and curation  If data standards are not enforced, mapping to standards may be required  Problems with homonyms (Turkey vs turkey) and synonyms (Puerto Rico and PR)  Not all tasks in metadata collection lend themselves to automation  Data entry quality control mechanisms are especially limited because of spreadsheet functionality  Could be 1-2 FTEs; not budgeted Compliance with HIPAA and other privacy regulations.  PATRIC does not anticipate working with identifying data but GSCIDs and investigators could be delayed by compliance issues Special cases  Metadata for genomes for NBCI bulk submission and non-unique taxon ids.  Metadata for growth conditions to be used with transcript datasets  Metadata for metagenomes to correlate genomes and proteins with useful info about sites and conditions How to we effectively exploit standardized data and metadata

Session #5 – Opportunities Existing relevant ontologies are in decent shape – GO, IDO, OBI Ontology for Biomedical Investigations (OBI) can provide a common framework for describing and exchanging datasets GSCID-BRC Metadata Working Group Leverage and harmonize with MIGS/MIMS We have the opportunity to establish policies for metadata collection, exchange, and release that would be broadly applicable. We are in the position to drive standards adoption The BRCs support many pathogens that infect the same host(s) … can we exploit this fact to create specialized views and tools for interacting with the host resources from both pathogen and host perspectives? Ontology-driven integration (GMOD, Population biology) Small sequencing centers  Offer community a standard metadata template for isolates  Bring your own data and metadata to PATRIC for annotation, analysis, long term metadata storage and dissemination Develop additional metadata standards and collect, store, and share additional metadata More efficient encoding of things like alignments

Presentations Ontology of Biomedical Investigations (OBI) – Bjoern Peters Infectious Disease Ontology (IDO) and extensions – Lindsay Cowell GSCID-BRC Metadata Working Group

Working group established to define common metadata standard for pathogen isolate sequencing projects Collaboration between BRCs, GSCIDs and NIAID Process  Collect spreadsheets, metadata examples, previous submission from sequencing projects  Core metadata fields collected by virus, bacteria and eukaryote subgroups  For each metadata field, propose:  preferred term  definition  synonyms  allowed values based on controlled vocabularies  preferred syntax  responsible provider  data category  examples  Merge recommendations from subgroups into a common core metadata using an OBI-based semantic framework  Develop recommendations for project-specific and pathogen-specific metadata fields  Harmonize with other relevant standards (MIGS/MIMS, IDO)  Establish policies and procedures for metadata submission workflows and GenBank linkage

Core Metadata Examples

data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeIDqualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes - independent continuant - dependent continuant - occurrent - temporal-spatial region ital- relations has_input has_quality instance_of temporal-spatial region located_in Network Overview

data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeIDqualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation

data transformations – image processing assembly sequencing assay organism environmental material equipment person sample material person equipment template role reagent role sequencing tech. role signal detection role specimen source role specimen capture role specimen collector role species/ strain organism ID age, gender, symptom temporal-spatial region spatial region temporal interval GPS location date/time cDNA sample data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol NA enrichment process NA enrichment protocol cDNA synthesis process cDNA synthesis protocol sequencing protocol algorithm temporal-spatial region data archiving process sequence data record has_input has_output plays has_specificationhas_parthas_specificationhas_parthas_specification located_in has_part denotes is_about has_input has_output has_input plays located_in has_specification has_input has_output is_about GenBank ID denotes located_in software has_input data transfer protocol has_specification common name denotes has_qualityinstance_of name denotes spatial region geographic location denotes located_in affiliation has_affiliatio n species/ strain instance_of ID amount has_quality v2 v5-6 v3-4 v7 v8 v10 v12 v11 v13 v15 v16 v22 v25 v23 v24 v27 v30 v32 v29v31v43 v40 v42 v45 v46 v44 vX– row X in virus sheet - independent continuant - dependent continuant - occurrent - temporal-spatial region ital- relations

Metadata Categories Investigation Specimen Isolation Specimen Processing Sample Shipment Pathogen Detection & Isolation Sequencing Sample Preparation Sequencing Assay Data Transformation

organism environmental material equipment person specimen source role specimen capture role specimen collector role species/ strain organism ID age, gender, symptom temporal-spatial region spatial region temporal interval GPS location date/time specimen X microorganism specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_i n common name denotes has_qualityinstance_of name denotes spatial region geographic location denotes located_in affiliation has_affiliatio n species/ strain instance_of ID v2 v5-6 v3-4 v7 v8 v10 v12 v11 v13 v15 v16 v27 denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input Comments ???? v9 organism part hypothesis v17 is_about IRB/IACUC approval has_authorization v19v18 temporal-spatial region spatial region temporal interval GPS location date/time has_part denotes spatial region geographic location denotes located_in vX– row X in virus sheet - independent continuant - dependent continuant - occurrent - temporal-spatial region ital- relations

temporal-spatial region spatial region temporal interval GPS location date/time specimen X microorganism X sample set X sample set assembly process X sample set assembly protocol has_output has_part has_specification has_part located_in spatial region geographic location species/ strain instance_of ID v15 v16 v27 Specimen Processing aliquoting process X aliquoting protocol has_input has_output has_specification specimen X aliquot Y specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality located_in sample set assembly process aliquoting process instance_of denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes specimen A aliquot B specimen M aliquot N specimen T aliquot U has_input v20 v22 v23 v24

sample set X at GSC sample set X in transit sample shipment process X sample shipment protocol sample receipt process X sample receipt protocol has_input has_output has_specification Sample Shipment sample set X ID sample set type amount denotes instance_of has_quality ID sample set type amount denotes instance_of has_quality ID sample set type amount denotes instance_of has_quality located_in sample shipment process sample receipt process instance_of temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes v21 sample X at GSC ID sample type amount denotes instance_of has_quality has_part v24 v23 v25

temporal-spatial region spatial region temporal interval GPS location date/time specimen X microorganism X has_part located_in spatial region geographic location species/ strain instance_of ID v15 v16 v27 Pathogen Detection & Isolation pathogen detection process X has_input has_specification data about pathogen presence specimen type amount denotes instance_of has_quality located_in pathogen detection method instance_of denotes pathogen detection protocol has_output v28 v26 is_about v34 temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location pathogen isolation process X located_in pathogen isolation method denotes pathogen detection protocol has_input instance_of has_specification pathogen isolate X ID pathogen type amount denotes instance_of has_quality has_output

temporal-spatial region spatial region temporal interval GPS location date/time cDNA sample X specimen X microorganism X enriched NA sample X microorganism genomic NA NA enrichment process X NA enrichment protocol cDNA synthesis process X cDNA synthesis protocol has_input has_output has_part has_specification has_part has_specification has_part located_in spatial region geographic location species/ strain instance_of ID v15 v16 v27 Sequencing Sample Preparation aliquoting process X aliquoting protocol has_input has_output has_specification specimen aliquot X specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality located_in NA enrichment process cDNA synthesis process aliquoting process instance_of denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes v35 v36 v37 v38 v39 v33

sequencing assay X sample material X person X equipment X lot # primary data sequencing protocol temporal-spatial region has_input located_in has_specification has_output v40 plays spatial region temporal interval GPS location date/time spatial region geographic location Sequencing Assay has_part located_in denotes run ID sequencing assay type denotes insatnce_of reagent role reagent type instance_of denotes sample ID plays template role sample type instance_of denotes name plays sequencing tech. role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input v14 v41 objectives – coverage, genome type targeted has_part

data transformations – image processing assembly X data transformations – variant detection primary data sequence data genotype data microorganism X microorganism genomic NA algorithm data archiving process sequence data record has_input instance_of has_specification has_input has_output is_about GenBank ID denotes software has_input data transfer protocol has_specification species/ strain has_output has_input temporal-spatial region located_in spatial region temporal interval GPS location date/time spatial region geographic location has_part located_in denotes person X name plays bioinformatics tech. role species instance_of denotes run ID denotes located_in data transformations – serotype marker detection serotype data data transformations – gene detection gene data part_of has_output is_about has_input Data Transformations temporal-spatial region spatial region temporal interval GPS location date/time spatial region geographic location has_part located_in denotes v29 v43 v31 v32 v42 v30 v44 v45 v46 v47

Investigation - independent continuant - dependent continuant - occurrent - temporal-spatial region ital- relations investigation study design has_part documenting study design execution has_part objective specification has_part data transformation has_part Information content entity has_specified_input specimen creation specimen preparation for assay sequencing assay has_part

assay X sample material X person X equipment X lot # primary data assay protocol temporal-spatial region has_input located_in has_specification has_output plays spatial region temporal interval GPS location date/time spatial region geographic location Generic Assay has_part located_in denotes run ID assay type denotes instance_of reagent role reagent type instance_of denotes sample ID plays target role sample type instance_of denotes name plays technician role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input objectives has_part analyte X has_part quality x has_quality input sample material X is_about

material transformation X sample material X person X equipment X lot # output material X material transformation protocol temporal-spatial region has_input located_in has_specification has_output plays spatial region temporal interval GPS location date/time spatial region geographic location Generic Material Transformation has_part located_in denotes run ID material transformation type denotes instance_of reagent role reagent type instance_of denotes sample ID plays target role sample type instance_of denotes name plays technician role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input objectives has_part quality x has_quality quality x material type has_quality instance_of sample ID denotes

data transformation X input data output data material X algorithm has_specification has_output is_about software has_input located_in person X name data analyst role denotes run ID denotes Generic Data Transformation temporal-spatial region spatial region temporal interval GPS location date/time spatial region geographic location has_part located_in denotes data transformation type instance_of plays

Generic Material (IC) material X ID material type quality x has_quality material Y has_part material Z has_part quality y has_quality denotes instance_of temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes located_in

Discussion Points MIBBI may not be sufficient  Don’t distinguish between minimum information to reproduce and experiment and the minimum information to structure in a database  Lack a semantic framework OBI-based framework is re-usable  Sequencing => “omics” Challenge of using ontologies for preferred value sets  Can be large  May not directly match common language Value of defining the semantic framework  Appropriate relations are retained  How can we take advantage of the framework for semantic query and inferential analysis? Practical issues about implementation strategies