High throughput biology data management and data intensive computing drivers George Michaels.

Slides:



Advertisements
Similar presentations
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Advertisements

1 Cyberinfrastructure Framework for 21st Century Science & Engineering (CIF21) NSF-wide Cyberinfrastructure Vision People, Sustainability, Innovation,
Wrapup. NHGRI strategic plan What does the NIH think genomics should be for the next 10 years? [Nature, Feb. 2011]
Introduction to Bioinformatics Richard H. Scheuermann, Ph.D. Director of Informatics JCVI.
Contents of this Talk [Used as intro to Genome Databases Seminar, 2002] Overview of bioinformatics Motivations for genome databases Analogy of virus reverse-eng.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
DNA in the chromosomes of the genome contains all the information to develop an organism and operate all its cell types.DNA in the chromosomes of the genome.
Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA
Data Management in the DOE Genomics:GTL Program Janet Jacobsen and Adam Arkin Lawrence Berkeley National Laboratory University of California, Berkeley.
Bioinformatics Needs for the post-genomic era Dr. Erik Bongcam-Rudloff The Linnaeus Centre for Bioinformatics.
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
Jeffery Loo NLM Associate Fellow ’03 – ’05 chemicalinformaticsforlibraries.
Pacific Northwest National Laboratory U.S. Department of Energy DOE Data Workshop View from Information-intensive Applications H. Steven Wiley Biomolecular.
Use of Ontologies in the Life Sciences: BioPax Graciela Gonzalez, PhD (some slides adapted from presentations available at
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
GTL Facilities Characterization and Imaging of Molecular Machines Lee Makowski.
ExPASy - Expert Protein Analysis System The bioinformatics resource portal and other resources An Overview.
Creating a … Community Database Organism-Specific Database Model-Organism Database.
GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
1 The Discovery Informatics Framework Pat Rougeau President and CEO MDL Information Systems, Inc. Delivering the Integration Promise American Chemical.
The BIO Directorate Microbial Biology Emphasis BIO Advisory Committee April, 2005.
Knowledgebase Creation & Systems Biology: A new prospect in discovery informatics S.Shriram, Siri Technologies (Cytogenomics), Bangalore S.Shriram, Siri.
DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of.
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
Beyond the Human Genome Project Future goals and projects based on findings from the HGP.
GTL Facilities Computing Infrastructure for 21 st Century Systems Biology Ed Uberbacher ORNL & Mike Colvin LLNL.
Bioinformatics Dr. Víctor Treviño BT4007
DOE Resources & Facilities for Biological Discovery : Realizing the Potential Presentation to the BERAC 25 April 2002.
Bioinformatics and medicine: Are we meeting the challenge?
Rahul Raman, Ram Sasisekharan Bioinformatics Core Massachusetts Institute of Technology Glue Grants Bioinformatics Meeting April 22-23, 2004 San Diego,
Integrated Biomedical Information for Better Health Workprogramme Call 4 IST Conference- Networking Session.
NATIONAL PARTNERSHIP FOR ADVANCED COMPUTATIONAL INFRASTRUCTURE Molecular Science in NPACI Russ B. Altman NPACI Molecular Science Thrust Stanford Medical.
Helping scientists collaborate BioCAD. ©2003 All Rights Reserved.
A COMPREHENSIVE GENE REGULATORY NETWORK FOR THE DIAUXIC SHIFT IN SACCHAROMYCES CEREVISIAE GEISTLINGER, L., CSABA, G., DIRMEIER, S., KÜFFNER, R., AND ZIMMER,
Teranode Tools and Platform for Pathway Analysis Michael Kellen, Solution Manager June 16, 2006.
Reconstruction of Transcriptional Regulatory Networks
Informatics Software and Services Jim Shaw BergenShaw International Integrate. Automate. Manage. Your company Logo In collaboration.
GTL User Facilities Facility IV: Analysis and Modeling of Cellular Systems Jim K. Fredrickson.
Network & Systems Modeling 29 June 2009 NCSU GO Workshop.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
ASCAC-BERAC Joint Panel on Accelerating Progress Toward GTL Goals Some concerns that were expressed by ASCAC members.
Workshop Aims NMSU GO Workshop 20 May Aims of this Workshop  WIIFM? modeling examples background information about GO modeling  Strategies for.
IPG2P Working Group Update. iPG2P Final deliverable: – Procedure allowing an investigator to begin with trait of interest in species possessing limited.
Systems Biology ___ Toward System-level Understanding of Biological Systems Hou-Haifeng.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
BRC 2011 Session #4 – “Omics” Data. Session #4 - Outline Challenges and Opportunities  pathogen datasets; host datasets; integrating pathogen-host datasets.
ACGT: Open Grid Services for Improving Medical Knowledge Discovery Stelios G. Sfakianakis, FORTH.
COMPUTERS IN BIOLOGY Elizabeth Muros INTRO TO PERSONAL COMPUTING.
Central dogma: the story of life RNA DNA Protein.
An approach to carry out research and teaching in Bioinformatics in remote areas Alok Bhattacharya Centre for Computational Biology & Bioinformatics JAWAHARLAL.
Data Management Support for Life Sciences or What can we do for the Life Sciences? Mourad Ouzzani
The Genomics: GTL Program Environmental Remediation Sciences Program Spring Workshop April 3, 2006.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
Visual Knowledge ® Software Inc. Visual Knowledge BioCAD Case Study Parallels to Other Domains VK Semantic Web Server.
BRIITE SEATTLE, 2003 SCIENTIFIC DATA MANAGEMENT WORKING GROUP.
Semantic Web - caBIG Abstract: 21st century biomedical research is driven by massive amounts of data: automated technologies generate hundreds of.
Biological Databases By: Komal Arora.
Data-intensive Computing: Case Study Area 1: Bioinformatics
Themes of Biology Chapter 1
Data challenges in the pharmaceutical industry
High-throughput Biological Data The data deluge
Workshop Aims TAMU GO Workshop 17 May 2010.
Department of Genetics • Stanford University School of Medicine
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Biomolecular Networks Initiative
Presentation transcript:

High throughput biology data management and data intensive computing drivers George Michaels

2 The Scope of the Problem A highly multidimensional world of complicated dynamic events Both synchronous and asynchronous processes Vast scales of time and space A hierarchy of simultaneous levels of activity Thousands of types of cells and environments

3 It’s all About the Complexity The Human genome has changed the way biologists approach scientific challenges. Biology is an information science Biology applications are scaling at a rate that exceeds the computing capability GTL presents the opportunity to expand throughput in 5-50 fold increases per year.

4 Billions of Bases in GenBank According to the GOLD database, there are 146 published genomes, 344 prokaryotic ongoing genomes projects, and 243 eukaryotic ongoing genome projects. DOE never supported a comprehensive and effective data management and curation program for Genbank. The Protein Data Bank (PDB) is a repeat of the same scenario. Both data base efforts were ahead of the science that capitalized on the work. Curation, Provanance strategies are still unsloved hard problems for these data.

5 Growth of Proteomic Data vs. Sequence Data

6 From BERAC – December 2002

7 Creating an Integrated Computational Biology Environment 2. Data Capture and Archiving 4. Interpretation / Modeling / Simulation 3. Data Analysis / Reduction 1. LIMS & Workflow Management 5. The Community Data Resource 6. Infrastructure Computing Issues for GTL Facilities and Projects

8 Creating an Integrated Computational Biology Environment The GTL Facilities will represent the cornerstone of the GTL enterprise and major sites for development of computing systems. They will generate massive amounts of data for use by the community and for constructing models of biology The facilities will be the sites where experiment workflow must be facilitated, data must be analyzed, and systems biology data and models provided to the community They are likely to contain integrated high performance computing, share suites of tools to analyze data and massive data archives. Their combined and integrated output will become the major portion of the GTL community resource (GTL knowledge base) Central Role of GTL Facilities in Compute Planning

9 Need New Data Handling and Computing Resources to Handle Data Tsunami Current data infrastructure DATA Sequence Proteomic Metabolic Image Modeling Simulation And more Help!

10 Experiment Design Metadata Issues Experiment design context provides the most powerful context dependent annotation for gene/protein activities Experiments designs will evolve over time Experiment designs should specify what data needs capturing Statistical experiment designs should drive Discovery activities Flexible approaches are needed to adapt to new data collection modes and data types Model driven experimentation needs to include the prediction/hypothesis tested Experiments  [samples, genetics, treatments, conditions, time, [quality measures]] Samples  [attributes,[measurements,[qc measures]]

11 GTL Experiment Template

12 Creating an Integrated Computational Biology Environment Data Capture and Archiving DBs Modeling and Simulation Data Analysis / Reduction LIMS & Workflow Management Output to community data resource The GTL Informatics Whole Picture Facility x Facility y Data Capture and Archiving DBs Modeling and Simulation Data Analysis / Reduction LIMS & Workflow Management Output to community data resource “The GTL ORACLE” Shared LIMS / Workflow Protein Production DB Protein Expression and Regulation DB Protein Machines DB Cell & Community Systems DB MassSpec Archive Image Archive Expression Archive Large-scale shared bulk data archives... Mass spec analysis tools Lib Confocal Image analysis tools Lib Expression Analysis Lib Shared Tools Libs... Modeling & simulation Tools Lib Molecular Dynamics Simulation Library Protein Machine modeling tools Regulatory network modeling tools...

13 1. Protein Production DB - microbial baseline annotation, genes, proteins... - catalog of proteins and reagents produced / inventory - biophysical and biochemical characterizations of proteins - protocols and methods 2. Protein Expression & Regulation DB - protein expression data per condition per microbe - regulatory networks based on expression data - metabolite / metabolic network data - protocols and methods 3. Protein Machines DB - protein machines catalog - protein machines models of organization / dynamics - protein interaction network models and simulations - protocols and methods 4. Cell and Community Systems DB - in vivo cell measurements of expression / machines - measurements of community interactions/ metabolism - integrated cell models (regulation, metabolism, signaling) - integrated community models Protein machines catalog Protein machines protocols / methods DB Protein machines models & simulations Interaction network models database Regulatory network models database Metabolic network models database Cell growth & methods & protocols Protein expression DB Microbial genome baseline annotation Proteins and reagents catalog Protein biophysical/ biochemical data Protein production protocols / methods In vivo protein and machine expression / localization Community metabolism and interactions Cell models and simulations Community models and simulations Facility 1 Data Resources Facility 2 Data Resources Facility 3 Data Resources Facility 4 Data Resources Community Data Resource What’s in the Knowledgebase?

R & D Challenges Design and Integration of the major databases Huge data volumes, great schema complexity - need for new types of databases (hardware and software) Database technologies – object-relational, graph DBs, … Data standards, representations, ontologies for very complex objects User Access Systems for browsing, query, visualization, and to run analysis or simulations Supporting Simulation from DBs - how to allow users to utilize models and run simulations; how to link simulations to underlying data Integration - Provide integrated view of the biology - With data from other community sources. Community access to compute power to run long time- scale simulations IP issues and reward system How to represent incomplete, sparse, conflicting data Community Data Resource