Presentation is loading. Please wait.

Presentation is loading. Please wait.

High throughput biology data management and data intensive computing drivers George Michaels.

Similar presentations


Presentation on theme: "High throughput biology data management and data intensive computing drivers George Michaels."— Presentation transcript:

1 High throughput biology data management and data intensive computing drivers George Michaels

2 2 The Scope of the Problem A highly multidimensional world of complicated dynamic events Both synchronous and asynchronous processes Vast scales of time and space A hierarchy of simultaneous levels of activity Thousands of types of cells and environments

3 3 It’s all About the Complexity The Human genome has changed the way biologists approach scientific challenges. Biology is an information science Biology applications are scaling at a rate that exceeds the computing capability GTL presents the opportunity to expand throughput in 5-50 fold increases per year.

4 4 Billions of Bases in GenBank 198219861990199419982002 According to the GOLD database, there are 146 published genomes, 344 prokaryotic ongoing genomes projects, and 243 eukaryotic ongoing genome projects. DOE never supported a comprehensive and effective data management and curation program for Genbank. The Protein Data Bank (PDB) is a repeat of the same scenario. Both data base efforts were ahead of the science that capitalized on the work. Curation, Provanance strategies are still unsloved hard problems for these data.

5 5 Growth of Proteomic Data vs. Sequence Data

6 6 From BERAC – December 2002

7 7 Creating an Integrated Computational Biology Environment 2. Data Capture and Archiving 4. Interpretation / Modeling / Simulation 3. Data Analysis / Reduction 1. LIMS & Workflow Management 5. The Community Data Resource 6. Infrastructure Computing Issues for GTL Facilities and Projects

8 8 Creating an Integrated Computational Biology Environment The GTL Facilities will represent the cornerstone of the GTL enterprise and major sites for development of computing systems. They will generate massive amounts of data for use by the community and for constructing models of biology The facilities will be the sites where experiment workflow must be facilitated, data must be analyzed, and systems biology data and models provided to the community They are likely to contain integrated high performance computing, share suites of tools to analyze data and massive data archives. Their combined and integrated output will become the major portion of the GTL community resource (GTL knowledge base) Central Role of GTL Facilities in Compute Planning

9 9 Need New Data Handling and Computing Resources to Handle Data Tsunami Current data infrastructure DATA Sequence Proteomic Metabolic Image Modeling Simulation And more Help!

10 10 Experiment Design Metadata Issues Experiment design context provides the most powerful context dependent annotation for gene/protein activities Experiments designs will evolve over time Experiment designs should specify what data needs capturing Statistical experiment designs should drive Discovery activities Flexible approaches are needed to adapt to new data collection modes and data types Model driven experimentation needs to include the prediction/hypothesis tested Experiments  [samples, genetics, treatments, conditions, time, [quality measures]] Samples  [attributes,[measurements,[qc measures]]

11 11 GTL Experiment Template

12 12 Creating an Integrated Computational Biology Environment Data Capture and Archiving DBs Modeling and Simulation Data Analysis / Reduction LIMS & Workflow Management Output to community data resource The GTL Informatics Whole Picture Facility x Facility y Data Capture and Archiving DBs Modeling and Simulation Data Analysis / Reduction LIMS & Workflow Management Output to community data resource “The GTL ORACLE” Shared LIMS / Workflow Protein Production DB Protein Expression and Regulation DB Protein Machines DB Cell & Community Systems DB MassSpec Archive Image Archive Expression Archive Large-scale shared bulk data archives... Mass spec analysis tools Lib Confocal Image analysis tools Lib Expression Analysis Lib Shared Tools Libs... Modeling & simulation Tools Lib Molecular Dynamics Simulation Library Protein Machine modeling tools Regulatory network modeling tools...

13 13 1. Protein Production DB - microbial baseline annotation, genes, proteins... - catalog of proteins and reagents produced / inventory - biophysical and biochemical characterizations of proteins - protocols and methods 2. Protein Expression & Regulation DB - protein expression data per condition per microbe - regulatory networks based on expression data - metabolite / metabolic network data - protocols and methods 3. Protein Machines DB - protein machines catalog - protein machines models of organization / dynamics - protein interaction network models and simulations - protocols and methods 4. Cell and Community Systems DB - in vivo cell measurements of expression / machines - measurements of community interactions/ metabolism - integrated cell models (regulation, metabolism, signaling) - integrated community models Protein machines catalog Protein machines protocols / methods DB Protein machines models & simulations Interaction network models database Regulatory network models database Metabolic network models database Cell growth & methods & protocols Protein expression DB Microbial genome baseline annotation Proteins and reagents catalog Protein biophysical/ biochemical data Protein production protocols / methods In vivo protein and machine expression / localization Community metabolism and interactions Cell models and simulations Community models and simulations Facility 1 Data Resources Facility 2 Data Resources Facility 3 Data Resources Facility 4 Data Resources Community Data Resource What’s in the Knowledgebase?

14 R & D Challenges Design and Integration of the major databases Huge data volumes, great schema complexity - need for new types of databases (hardware and software) Database technologies – object-relational, graph DBs, … Data standards, representations, ontologies for very complex objects User Access Systems for browsing, query, visualization, and to run analysis or simulations Supporting Simulation from DBs - how to allow users to utilize models and run simulations; how to link simulations to underlying data Integration - Provide integrated view of the biology - With data from other community sources. Community access to compute power to run long time- scale simulations IP issues and reward system How to represent incomplete, sparse, conflicting data Community Data Resource


Download ppt "High throughput biology data management and data intensive computing drivers George Michaels."

Similar presentations


Ads by Google