Presentation is loading. Please wait.

Presentation is loading. Please wait.

SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata.

Similar presentations


Presentation on theme: "SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata."— Presentation transcript:

1 SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata

2 Session #5 - Outline Motivation Opportunities, Challenges and Talking Points  minimum information checklists  ontology-based value sets  use cases for metadata  SOPs for data & metadata acquisition Ontology of Biomedical Investigations – Bjoern Peters Infectious Disease Ontology and extensions – Lindsay Cowell GSCID-BRC Metadata Working Group efforts Open discussion

3 Why Data Standards Interoperability - the ability to exchange information between people, organizations, machines Comparability - the ability to ascertain the equivalence of data from different sources Data Quality – asses the completeness, accuracy and precision of the data Dependability – ensures that you get what you expect from a database query Accurate Statistical Analysis Inference

4 What Data Standards Minimum Information Sets – what needs to be described Structured Vocabulary/Ontology – how to describe them  Term strings – unique identifiers  Definitions - what terms mean  Syntax - how terms are used Semantics - how the components relate to each other

5 Session #5 – Challenges Status of relevant data standards  Few data standards that have been widely adopted by the infectious diseases community  Some standards are being development without engagement of all relevant stakeholders  If we drive standards development, how do we get broad adoption Adoption of data standards by data providers  Even if vocabulary standards are available, how do we get the broader community to use them  How do we educate them to use the data standards accurately  How to keep the barrier low for getting required meta-data in a standard format Technical challenges  Usability is constrained by spreadsheet interface  Ontology-based controlled vocabularies sometimes too large for spreadsheet like interface or drop down lists  While web-based GUI smart forms are good for single submission, difficult to design them to scale Need for quality control and curation  If data standards are not enforced, mapping to standards may be required  Problems with homonyms (Turkey vs turkey) and synonyms (Puerto Rico and PR)  Not all tasks in metadata collection lend themselves to automation  Data entry quality control mechanisms are especially limited because of spreadsheet functionality  Could be 1-2 FTEs; not budgeted Compliance with HIPAA and other privacy regulations.  PATRIC does not anticipate working with identifying data but GSCIDs and investigators could be delayed by compliance issues Special cases  Metadata for genomes for NBCI bulk submission and non-unique taxon ids.  Metadata for growth conditions to be used with transcript datasets  Metadata for metagenomes to correlate genomes and proteins with useful info about sites and conditions How to we effectively exploit standardized data and metadata

6

7 Session #5 – Opportunities Existing relevant ontologies are in decent shape – GO, IDO, OBI Ontology for Biomedical Investigations (OBI) can provide a common framework for describing and exchanging datasets GSCID-BRC Metadata Working Group Leverage and harmonize with MIGS/MIMS We have the opportunity to establish policies for metadata collection, exchange, and release that would be broadly applicable. We are in the position to drive standards adoption The BRCs support many pathogens that infect the same host(s) … can we exploit this fact to create specialized views and tools for interacting with the host resources from both pathogen and host perspectives? Ontology-driven integration (GMOD, Population biology) Small sequencing centers  Offer community a standard metadata template for isolates  Bring your own data and metadata to PATRIC for annotation, analysis, long term metadata storage and dissemination Develop additional metadata standards and collect, store, and share additional metadata More efficient encoding of things like alignments

8 Presentations Ontology of Biomedical Investigations (OBI) – Bjoern Peters Infectious Disease Ontology (IDO) and extensions – Lindsay Cowell GSCID-BRC Metadata Working Group

9 Working group established to define common metadata standard for pathogen isolate sequencing projects Collaboration between BRCs, GSCIDs and NIAID Process  Collect spreadsheets, metadata examples, previous submission from sequencing projects  Core metadata fields collected by virus, bacteria and eukaryote subgroups  For each metadata field, propose:  preferred term  definition  synonyms  allowed values based on controlled vocabularies  preferred syntax  responsible provider  data category  examples  Merge recommendations from subgroups into a common core metadata using an OBI-based semantic framework  Develop recommendations for project-specific and pathogen-specific metadata fields  Harmonize with other relevant standards (MIGS/MIMS, IDO)  Establish policies and procedures for metadata submission workflows and GenBank linkage

10 Core Metadata Examples

11 data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeIDqualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes - independent continuant - dependent continuant - occurrent - temporal-spatial region ital- relations has_input has_quality instance_of temporal-spatial region located_in Network Overview

12 data transformations – image processing assembly sequencing assay specimen source – organism or environmental specimen collector input sample reagents technician equipment typeIDqualities temporal-spatial region data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol sample processing data archiving process sequence data record has_input has_output has_specificationhas_part is_about has_input has_output has_input has_output is_about GenBank ID denotes located_in denotes has_input has_quality instance_of temporal-spatial region located_in Specimen Isolation Material Processing Data Processing Sequencing Assay Investigation

13 data transformations – image processing assembly sequencing assay organism environmental material equipment person sample material person equipment template role reagent role sequencing tech. role signal detection role specimen source role specimen capture role specimen collector role species/ strain organism ID age, gender, symptom temporal-spatial region spatial region temporal interval GPS location date/time cDNA sample data transformations – variant detection serotype marker detect. gene detection primary data sequence data genotype/serotype/ gene data specimen microorganism enriched NA sample microorganism genomic NA specimen isolation process isolation protocol NA enrichment process NA enrichment protocol cDNA synthesis process cDNA synthesis protocol sequencing protocol algorithm temporal-spatial region data archiving process sequence data record has_input has_output plays has_specificationhas_parthas_specificationhas_parthas_specification located_in has_part denotes is_about has_input has_output has_input plays located_in has_specification has_input has_output is_about GenBank ID denotes located_in software has_input data transfer protocol has_specification common name denotes has_qualityinstance_of name denotes spatial region geographic location denotes located_in affiliation has_affiliatio n species/ strain instance_of ID amount has_quality v2 v5-6 v3-4 v7 v8 v10 v12 v11 v13 v15 v16 v22 v25 v23 v24 v27 v30 v32 v29v31v43 v40 v42 v45 v46 v44 vX– row X in virus sheet - independent continuant - dependent continuant - occurrent - temporal-spatial region ital- relations

14 Metadata Categories Investigation Specimen Isolation Specimen Processing Sample Shipment Pathogen Detection & Isolation Sequencing Sample Preparation Sequencing Assay Data Transformation

15 organism environmental material equipment person specimen source role specimen capture role specimen collector role species/ strain organism ID age, gender, symptom temporal-spatial region spatial region temporal interval GPS location date/time specimen X microorganism specimen isolation procedure X isolation protocol has_input has_output plays has_specification has_part denotes located_i n common name denotes has_qualityinstance_of name denotes spatial region geographic location denotes located_in affiliation has_affiliatio n species/ strain instance_of ID v2 v5-6 v3-4 v7 v8 v10 v12 v11 v13 v15 v16 v27 denotes specimen type instance_of specimen isolation procedure type instance_of Specimen Isolation plays has_input Comments ???? v9 organism part hypothesis v17 is_about IRB/IACUC approval has_authorization v19v18 temporal-spatial region spatial region temporal interval GPS location date/time has_part denotes spatial region geographic location denotes located_in vX– row X in virus sheet - independent continuant - dependent continuant - occurrent - temporal-spatial region ital- relations

16 temporal-spatial region spatial region temporal interval GPS location date/time specimen X microorganism X sample set X sample set assembly process X sample set assembly protocol has_output has_part has_specification has_part located_in spatial region geographic location species/ strain instance_of ID v15 v16 v27 Specimen Processing aliquoting process X aliquoting protocol has_input has_output has_specification specimen X aliquot Y specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality located_in sample set assembly process aliquoting process instance_of denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes specimen A aliquot B specimen M aliquot N specimen T aliquot U has_input v20 v22 v23 v24

17 sample set X at GSC sample set X in transit sample shipment process X sample shipment protocol sample receipt process X sample receipt protocol has_input has_output has_specification Sample Shipment sample set X ID sample set type amount denotes instance_of has_quality ID sample set type amount denotes instance_of has_quality ID sample set type amount denotes instance_of has_quality located_in sample shipment process sample receipt process instance_of temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes v21 sample X at GSC ID sample type amount denotes instance_of has_quality has_part v24 v23 v25

18 temporal-spatial region spatial region temporal interval GPS location date/time specimen X microorganism X has_part located_in spatial region geographic location species/ strain instance_of ID v15 v16 v27 Pathogen Detection & Isolation pathogen detection process X has_input has_specification data about pathogen presence specimen type amount denotes instance_of has_quality located_in pathogen detection method instance_of denotes pathogen detection protocol has_output v28 v26 is_about v34 temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location pathogen isolation process X located_in pathogen isolation method denotes pathogen detection protocol has_input instance_of has_specification pathogen isolate X ID pathogen type amount denotes instance_of has_quality has_output

19 temporal-spatial region spatial region temporal interval GPS location date/time cDNA sample X specimen X microorganism X enriched NA sample X microorganism genomic NA NA enrichment process X NA enrichment protocol cDNA synthesis process X cDNA synthesis protocol has_input has_output has_part has_specification has_part has_specification has_part located_in spatial region geographic location species/ strain instance_of ID v15 v16 v27 Sequencing Sample Preparation aliquoting process X aliquoting protocol has_input has_output has_specification specimen aliquot X specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality ID specimen type amount denotes instance_of has_quality located_in NA enrichment process cDNA synthesis process aliquoting process instance_of denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes v35 v36 v37 v38 v39 v33

20 sequencing assay X sample material X person X equipment X lot # primary data sequencing protocol temporal-spatial region has_input located_in has_specification has_output v40 plays spatial region temporal interval GPS location date/time spatial region geographic location Sequencing Assay has_part located_in denotes run ID sequencing assay type denotes insatnce_of reagent role reagent type instance_of denotes sample ID plays template role sample type instance_of denotes name plays sequencing tech. role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input v14 v41 objectives – coverage, genome type targeted has_part

21 data transformations – image processing assembly X data transformations – variant detection primary data sequence data genotype data microorganism X microorganism genomic NA algorithm data archiving process sequence data record has_input instance_of has_specification has_input has_output is_about GenBank ID denotes software has_input data transfer protocol has_specification species/ strain has_output has_input temporal-spatial region located_in spatial region temporal interval GPS location date/time spatial region geographic location has_part located_in denotes person X name plays bioinformatics tech. role species instance_of denotes run ID denotes located_in data transformations – serotype marker detection serotype data data transformations – gene detection gene data part_of has_output is_about has_input Data Transformations temporal-spatial region spatial region temporal interval GPS location date/time spatial region geographic location has_part located_in denotes v29 v43 v31 v32 v42 v30 v44 v45 v46 v47

22 Investigation - independent continuant - dependent continuant - occurrent - temporal-spatial region ital- relations investigation study design has_part documenting study design execution has_part objective specification has_part data transformation has_part Information content entity has_specified_input specimen creation specimen preparation for assay sequencing assay has_part

23 assay X sample material X person X equipment X lot # primary data assay protocol temporal-spatial region has_input located_in has_specification has_output plays spatial region temporal interval GPS location date/time spatial region geographic location Generic Assay has_part located_in denotes run ID assay type denotes instance_of reagent role reagent type instance_of denotes sample ID plays target role sample type instance_of denotes name plays technician role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input objectives has_part analyte X has_part quality x has_quality input sample material X is_about

24 material transformation X sample material X person X equipment X lot # output material X material transformation protocol temporal-spatial region has_input located_in has_specification has_output plays spatial region temporal interval GPS location date/time spatial region geographic location Generic Material Transformation has_part located_in denotes run ID material transformation type denotes instance_of reagent role reagent type instance_of denotes sample ID plays target role sample type instance_of denotes name plays technician role species instance_of denotes serial # plays signal detection role equipment type instance_of denotes has_input objectives has_part quality x has_quality quality x material type has_quality instance_of sample ID denotes

25 data transformation X input data output data material X algorithm has_specification has_output is_about software has_input located_in person X name data analyst role denotes run ID denotes Generic Data Transformation temporal-spatial region spatial region temporal interval GPS location date/time spatial region geographic location has_part located_in denotes data transformation type instance_of plays

26 Generic Material (IC) material X ID material type quality x has_quality material Y has_part material Z has_part quality y has_quality denotes instance_of temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes temporal-spatial region spatial region temporal interval GPS location date/time has_part located_in spatial region geographic location denotes located_in

27 Discussion Points MIBBI may not be sufficient  Don’t distinguish between minimum information to reproduce and experiment and the minimum information to structure in a database  Lack a semantic framework OBI-based framework is re-usable  Sequencing => “omics” Challenge of using ontologies for preferred value sets  Can be large  May not directly match common language Value of defining the semantic framework  Appropriate relations are retained  How can we take advantage of the framework for semantic query and inferential analysis? Practical issues about implementation strategies


Download ppt "SESSION CHAIR: RICHARD SCHEUERMANN (VIPR & IRD) BRC2011 Session #5 – Data Standards and Metadata."

Similar presentations


Ads by Google