Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Slides:



Advertisements
Similar presentations
I2S2 - Infrastructure for Integration in Structural Sciences Information Model Development Workshop RAL 11 th February 2010
Advertisements

SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.
Database management system (DBMS)  a DBMS allows users and other software to store and retrieve data in a structured way  controls the organization,
The Imperial College Tissue Bank A searchable catalogue for tissues, research projects and data outcomes Prof Gerry Thomas - Dept. Surgery & Cancer The.
Knowledge Enabled Information and Services Science What can SW do for HCLS today? Panel at HCSL Workshop, WWW2007 Amit Sheth Kno.e.sis Center Wright State.
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
 Goals Unambiguous description of how the investigation was performed Consistent annotation, powerful queries and data integration  Details NOT model.
Diverse group of microbial ecologists, molecular biologists, biogeochemists, chemists, toxicologists, system biologists, geneticists.
Proposal for a Standard Representation of the Results of GC-MS Analysis: A Module for ArMet Helen Fuell 1, Manfred Beckmann 2, John Draper 2, Oliver Fiehn.
6/10/2015 ©T. C. Hazen #1 Center for Environmental Biotechnology Center for Environmental Biotechnology Rapid deduction of bacteria stress response pathways:
Office of Science Office of Biological and Environmental Research Susan K. Gregurick, Ph.D. Program Manager Computational Biology & Bioinformatics Biological.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
New Approaches for High-Throughput Identification and Characterization of Protein Complexes Michelle V. Buchanan Oak Ridge National Laboratory NIH Workshop.
Data, data standards and sharing Dr Daniel Swan Bioinformatics Support Unit
Information systems and databases Database information systems Read the textbook: Chapter 2: Information systems and databases FOR MORE INFO...
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.
Gene expression services: ArrayExpress and the Gene Expression Atlas Contact: Gabriella Rustici, PhD Functional Genomics Team EBI-EMBL
Chapter 9 Database Planning, Design, and Administration Sungchul Hong.
SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.
What are research data? July 2015 This work is licensed under a Creative Commons Attribution 4.0 International LicenseCreative Commons Attribution 4.0.
CSC2012 Database Technology & CSC2513 Database Systems.
DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of.
The Functional Genomics Experiment Model (FuGE) Andy Jones School of Computer Science and Faculty of Life Sciences, University of Manchester.
ITEC224 Database Programming
Recordkeeping for Good Governance Toolkit Digital Recordkeeping Guidance Funafuti, Tuvalu – June 2013.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
DOE Resources & Facilities for Biological Discovery : Realizing the Potential Presentation to the BERAC 25 April 2002.
Integrated e-Infrastructure for Scientific Facilities Kerstin Kleese van Dam STFC- e-Science Centre Daresbury Laboratory
Database System Concepts and Architecture
9/14/2012ISC329 Isabelle Bichindaritz1 Database System Life Cycle.
Information Systems: Databases Define the role of general information systems Describe the elements of a database management system (DBMS) Describe the.
GTL User Facilities Facility IV: Analysis and Modeling of Cellular Systems Jim K. Fredrickson.
High Throughput Screening of Materials (CCP9) Friday 20 th April 2012 CXD Workshop.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
8/31/2012ISC329 Isabelle Bichindaritz1 Database Environment.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
 So far in ICT we’ve covered how data is entered into computers (data capture) and how it’s checked (validation and verification).  In this section.
Committee membership Chris Somerville (Chair) Michelle S. Broido (BERAC Chair) John Pierce Margaret Riley Mel Simon.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
CIS/SUSL1 Fundamentals of DBMS S.V. Priyan Head/Department of Computing & Information Systems.
XML Standards for Proteomics Data Andrew Jones, Dr Jonathan Wastling and Dr Ela Hunt Department of Computing Science and the Institute of Biomedical and.
Valentina Di Francesco Senior Program Officer for Bioinformatics, Structural Genomics and Systems Biology Microbial Genomics.
FuGE: A framework for developing standards for functional genomics Andrew Jones School of Computer Science, University of Manchester Metabomeeting 2.0.
Representing Flow Cytometry Experiments within FuGE Josef Spidlen 1, Peter Wilkinson 2, and Ryan Brinkman 1 1 BC Cancer Research Centre, Vancouver, BC,
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Johannes Griss PSI Meeting Heidelberg, April 2011 EBI is an Outstation of the European Molecular Biology Laboratory. mzTab Proposal for.
Electronic labnotes Mari Wigham COMMIT/. Information WUR  Organising, sharing, finding and reusing data  Expertise in: ● Modelling data.
Wheat Data Interoperability Esther DZALE YEUMO KABORE Richard FULSS.
Digital Library The networked collections of digital text, documents, images, sounds, scientific data, and software that are the core of today’s Internet.
Bioinformatics Research Overview Outline Biomedical Ontologies oGlycO oEnzyO oProPreO Scientific Workflow for analysis of Proteomics Data Framework for.
Sharing the knowledge of electrophysiology data Phillip Lord, Frank Gibson and the CARMEN Consortium.
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
Integration of Bioinformatics into Inquiry Based Learning by Kathleen Gabric.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
The Genomics: GTL Program Environmental Remediation Sciences Program Spring Workshop April 3, 2006.
Presentation on Database management Submitted To: Prof: Rutvi Sarang Submitted By: Dharmishtha A. Baria Roll:No:1(sem-3)
High throughput biology data management and data intensive computing drivers George Michaels.
Session 6: Data Flow, Data Management, and Data Quality.
Introduction: Databases and Database Systems Lecture # 1 June 19,2012 National University of Computer and Emerging Sciences.
Training Course on Data Management for Information Professionals and In-Depth Digitization Practicum September 2011, Oostende, Belgium Concepts.
Democratization of ‘Omics Data Availability and Review Robert Chalkley UCSF Data Management Editor - MCP.
Making “Open Data” Work: Challenges for Data Integration in Genomics Research
HIV Drug Resistance Training
Developing Information Systems
“Proteomics is a science that focuses on the study of proteins: their roles, their structures, their localization, their interactions, and other factors.”
Bird of Feather Session
Presentation transcript:

Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley

Topics (talk or handout) Basic facts about the Genomics:GTL Program Goals of the GTL Program Experimental data generated by GTL Laboratory methods Data management challenges, requirements, and needs Survey on Data Standards, Data Sharing, and Data Management – if time Overall Recommendations Lawrence Berkeley National Laboratory  University of California 2

Genomics:GTL Program Genomes to Life renamed Genomics:GTL One of three DOE genome programs First funding awards in July 2002 Plan to fund and develop four user facilities –Production and Characterization of Proteins –Whole Proteome Analysis –Characterization and Imaging of Molecular Machines –Analysis and Modeling of Cellular Systems Lawrence Berkeley National Laboratory  University of California 3

Goals of the GTL Program Microbes are ubiquitous and have adapted to practically every environmental niche on earth. Some live and thrive in conditions generally thought to be inhospitable to life. GTL plans to study microbes and microbial communities that may be helpful in energy generation, environmental cleanup, carbon sequestration. Lawrence Berkeley National Laboratory  University of California 4

Categories of Experimental Data Biomass production Genomic –sequence and annotate the microbe’s genome Transcriptomic –study transcription under different conditions Proteomic –what proteins are present and at what levels Metabolomic –what metabolites are present and others… Lawrence Berkeley National Laboratory  University of California 5

Laboratory Methods Biomass production –cell culture Transcriptomic (HTP) –microarrays Proteomic (HTP) –2D gels, mass spectrometry Metabolomic (~HTP) –mass spectrometry, NMR Lawrence Berkeley National Laboratory  University of California 6

Data Volume and Complexity Example: mass spectrometry mass spec used to identify proteins raw data analyzed to get peak list peak list used to identify peptides database search to identify proteins from peptides Volume: size of raw data set per experiment ~ 10 GB multiple experiments per __/per organization use FedEx to ship disk drives Complexity: see PEDRo UML class diagram on next slide Lawrence Berkeley National Laboratory  University of California 7 raw data proteins peak list peptides

8

Data Management Challenges 1.INTEGRATING DATA FROM DIVERSE SOURCES IS THE KEY TO GTL’S SUCCESS diverse = different laboratory methods, different organizations, different aspects of cellular functions/pathways 2.CAPTURING METADATA IS VERY IMPORTANT 3.In the future, we must be able to process LARGE numbers of LARGE data sets Item 3 is important, but not as important as items 1 and 2. We have to address those first. Lawrence Berkeley National Laboratory  University of California 9

Why is Data Integration So Important to the GTL Program? Experimental data will be used to build models of cellular pathways, i.e., what goes on inside of the cell. Different types of data contribute to building different aspects of the model (response to environmental conditions, growth phases, etc.). Think of building a pathway as an inverse problem. In addition, experimental data are used to verify models. Lawrence Berkeley National Laboratory  University of California 10

Why are MetaData So Important to the GTL Program? We need to capture not only sample treatment (e.g., heat shock, oxygen stress), but all of the conditions under which an experimental analysis was performed. Otherwise we cannot compare the results from different experiments. We want to investigate how the same organism responds to different conditions, and how different organisms respond to the same condition. We also want to capture uncertainty. Lawrence Berkeley National Laboratory  University of California 11

Other Data Management Needs All of the usual ones… secure access storage of large volumes of data data archives data provenance plus one wrinkle… “staging of data access and management”. Lawrence Berkeley National Laboratory  University of California 12

Staging of Data Access/Management Stage 1: data collected and QA/QC within the lab producing the data – manage data locally. Stage 2: data are shared with other project collaborators – transport data and/or provide restricted access. Stage 3: data are published and move into the public domain –provide community-wide access to data. Stage 4: data are archived – need to provide safe storage that data could be retrieved from. Lawrence Berkeley National Laboratory  University of California 13

Survey on Data Standards, Data Sharing, and Data Management Follow up to work by the GTL Data Standards Working Group Link to survey mailed to registrants for GTL Program Workshop 50+ respondents – mostly experimental biologists – 26 from nat’l labs, 16 from universities, 8 from other organizations See handout for summary of survey results Lawrence Berkeley National Laboratory  University of California 14

Survey Results Most common data ‘format’ (78%): spreadsheet Most common measurement type (70%): image Few respondents are using any data standard. FCS (Flow Cytometry Standard), which is a file format, is the only data standard that received a high rating. About 20% of the respondents expressed a willingness to participate in developing or implementing data standards for GTL. Lawrence Berkeley National Laboratory  University of California 15

Recommendations from the Survey Checklist of required information about experiments, experimental conditions, and data Data standards, data formats, file formats Software tools/Web interfaces for –data entry, including metadata and experiment details –data uploading, query, and access Data organization to relate information on sample origin to experimental data on the sample DBMS with software to enter data Lawrence Berkeley National Laboratory  University of California 16

Comments from the Survey “It will help me a lot if someone will offer a short seminar on data standards.” Data standards are “of more interest to computer scientists than [to] biological scientists.” “This is all Greek to me which is exactly why very little to nothing is being developed that is useful to biologists like me.” Lawrence Berkeley National Laboratory  University of California 17

Difficulties in GTL Data Management Heterogenous data. Metadata. Uncertainty. Lack of data standards. (Love/hate relationship.) Variety of DBMS being used. Variety of instrument output formats. Different DM phases with respect to data generation, analyses, and publication. Human factors: lab notebook -> electronic format (potential loss of information), data rearrangement in spreadsheets. Data attribution. Lawrence Berkeley National Laboratory  University of California 18

Overall Recommendations GTL Program: Establish data standards and facilitate implementation. Data standards MUST be compatible with formats required by journals. Establish project-wide schema for organism/gene based database(s) to facilitate integration. Address data conversion problem. DOE: Require description of data management plan as part of proposal. (Currently being done?) Investigate digital notepad technology? Lawrence Berkeley National Laboratory  University of California 19

Acknowledgements Carol Giometti Argonne National Lab Frank Olken Lawrence Berkeley National Laboratory Nancy Slater, GTL Project Manager Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory  University of California 20