Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.

Similar presentations


Presentation on theme: "Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley."— Presentation transcript:

1 Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley

2 Topics (talk or handout) Basic facts about the Genomics:GTL Program Goals of the GTL Program Experimental data generated by GTL Laboratory methods Data management challenges, requirements, and needs Survey on Data Standards, Data Sharing, and Data Management – if time Overall Recommendations Lawrence Berkeley National Laboratory  University of California 2

3 Genomics:GTL Program Genomes to Life renamed Genomics:GTL One of three DOE genome programs First funding awards in July 2002 Plan to fund and develop four user facilities –Production and Characterization of Proteins –Whole Proteome Analysis –Characterization and Imaging of Molecular Machines –Analysis and Modeling of Cellular Systems Lawrence Berkeley National Laboratory  University of California 3

4 Goals of the GTL Program Microbes are ubiquitous and have adapted to practically every environmental niche on earth. Some live and thrive in conditions generally thought to be inhospitable to life. GTL plans to study microbes and microbial communities that may be helpful in energy generation, environmental cleanup, carbon sequestration. Lawrence Berkeley National Laboratory  University of California 4

5 Categories of Experimental Data Biomass production Genomic –sequence and annotate the microbe’s genome Transcriptomic –study transcription under different conditions Proteomic –what proteins are present and at what levels Metabolomic –what metabolites are present and others… Lawrence Berkeley National Laboratory  University of California 5

6 Laboratory Methods Biomass production –cell culture Transcriptomic (HTP) –microarrays Proteomic (HTP) –2D gels, mass spectrometry Metabolomic (~HTP) –mass spectrometry, NMR Lawrence Berkeley National Laboratory  University of California 6

7 Data Volume and Complexity Example: mass spectrometry mass spec used to identify proteins raw data analyzed to get peak list peak list used to identify peptides database search to identify proteins from peptides Volume: size of raw data set per experiment ~ 10 GB multiple experiments per __/per organization use FedEx to ship disk drives Complexity: see PEDRo UML class diagram on next slide Lawrence Berkeley National Laboratory  University of California 7 raw data proteins peak list peptides

8 8

9 Data Management Challenges 1.INTEGRATING DATA FROM DIVERSE SOURCES IS THE KEY TO GTL’S SUCCESS diverse = different laboratory methods, different organizations, different aspects of cellular functions/pathways 2.CAPTURING METADATA IS VERY IMPORTANT 3.In the future, we must be able to process LARGE numbers of LARGE data sets Item 3 is important, but not as important as items 1 and 2. We have to address those first. Lawrence Berkeley National Laboratory  University of California 9

10 Why is Data Integration So Important to the GTL Program? Experimental data will be used to build models of cellular pathways, i.e., what goes on inside of the cell. Different types of data contribute to building different aspects of the model (response to environmental conditions, growth phases, etc.). Think of building a pathway as an inverse problem. In addition, experimental data are used to verify models. Lawrence Berkeley National Laboratory  University of California 10

11 Why are MetaData So Important to the GTL Program? We need to capture not only sample treatment (e.g., heat shock, oxygen stress), but all of the conditions under which an experimental analysis was performed. Otherwise we cannot compare the results from different experiments. We want to investigate how the same organism responds to different conditions, and how different organisms respond to the same condition. We also want to capture uncertainty. Lawrence Berkeley National Laboratory  University of California 11

12 Other Data Management Needs All of the usual ones… secure access storage of large volumes of data data archives data provenance plus one wrinkle… “staging of data access and management”. Lawrence Berkeley National Laboratory  University of California 12

13 Staging of Data Access/Management Stage 1: data collected and QA/QC within the lab producing the data – manage data locally. Stage 2: data are shared with other project collaborators – transport data and/or provide restricted access. Stage 3: data are published and move into the public domain –provide community-wide access to data. Stage 4: data are archived – need to provide safe storage that data could be retrieved from. Lawrence Berkeley National Laboratory  University of California 13

14 Survey on Data Standards, Data Sharing, and Data Management Follow up to work by the GTL Data Standards Working Group Link to survey mailed to registrants for GTL Program Workshop 50+ respondents – mostly experimental biologists – 26 from nat’l labs, 16 from universities, 8 from other organizations See handout for summary of survey results Lawrence Berkeley National Laboratory  University of California 14

15 Survey Results Most common data ‘format’ (78%): spreadsheet Most common measurement type (70%): image Few respondents are using any data standard. FCS (Flow Cytometry Standard), which is a file format, is the only data standard that received a high rating. About 20% of the respondents expressed a willingness to participate in developing or implementing data standards for GTL. Lawrence Berkeley National Laboratory  University of California 15

16 Recommendations from the Survey Checklist of required information about experiments, experimental conditions, and data Data standards, data formats, file formats Software tools/Web interfaces for –data entry, including metadata and experiment details –data uploading, query, and access Data organization to relate information on sample origin to experimental data on the sample DBMS with software to enter data Lawrence Berkeley National Laboratory  University of California 16

17 Comments from the Survey “It will help me a lot if someone will offer a short seminar on data standards.” Data standards are “of more interest to computer scientists than [to] biological scientists.” “This is all Greek to me which is exactly why very little to nothing is being developed that is useful to biologists like me.” Lawrence Berkeley National Laboratory  University of California 17

18 Difficulties in GTL Data Management Heterogenous data. Metadata. Uncertainty. Lack of data standards. (Love/hate relationship.) Variety of DBMS being used. Variety of instrument output formats. Different DM phases with respect to data generation, analyses, and publication. Human factors: lab notebook -> electronic format (potential loss of information), data rearrangement in spreadsheets. Data attribution. Lawrence Berkeley National Laboratory  University of California 18

19 Overall Recommendations GTL Program: Establish data standards and facilitate implementation. Data standards MUST be compatible with formats required by journals. Establish project-wide schema for organism/gene based database(s) to facilitate integration. Address data conversion problem. DOE: Require description of data management plan as part of proposal. (Currently being done?) Investigate digital notepad technology? Lawrence Berkeley National Laboratory  University of California 19

20 Acknowledgements Carol Giometti Argonne National Lab Frank Olken Lawrence Berkeley National Laboratory Nancy Slater, GTL Project Manager Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory  University of California 20


Download ppt "Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley."

Similar presentations


Ads by Google