Presentation is loading. Please wait.

Presentation is loading. Please wait.

DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of.

Similar presentations


Presentation on theme: "DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of."— Presentation transcript:

1 DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of Science March 22, 2004

2

3 GTL Program Goals Using DNA sequence and high-throughput technologies goal 1 Identify and characterize the molecular machines of life goal 2 Characterize gene regulatory networks goal 3 Characterize the functional repertoire of complex microbial communities in their natural environments at the molecular level goal 4 Develop the computational capabilities to advance understanding of complex biological systems and predict their behavior Systems Biology Gain a comprehensive and predictive understanding of the dynamic, interconnected processes underlying living systems

4 Experimental: Complete datasets Quantitative measurements Comprehensive physical characterization:  Protein expression and interactions  Spatial distributions  Process kinetics Computational: Automated data analysis and validation Automated integration of diverse data sets Human and computer-accessible databases Molecular, Pathway and cell-level simulations The goals require a new synergy between computing and biology. Ultimate Goal is to Provide Predictive Models of Microbes This goal drives data collection and computing strategy.

5 GTL Experiment Template Generating Petascale Data Sets While this example does not account for data processing and compression it illustrates how even simple raw data storage will quickly become a bottleneck for biologists.

6 ATCGTAGCAATCGACCGT... CGGCTATAGCCGTTACCG… TTATGCTATCCATAATCGA... GGCTTAATCGCATACGAC... Capacity: e.g., High- throughput protein structure predictions, data analysis, sequence comparison Thread onto templates Best match Capability: e.g., Large scale biophysical simulations, stochastic regulatory simulations: Large size and timescale classical simulations Highly accurate quantum mechanical simulations GTL Science will Require High Performance Computing for Both Capacity and Capability Problems

7 Petascale Capacity Problems in Biology Microbial and Community Genome Annotation Analyze and annotate 20 microbial genomes - (720,000 processor hours) Now In 5 years Assemble, analyze and annotate community of 200 microbes and phage (10,000,000 processor hours) Compare genome sequences (200 megabases)to previous genomes (4 gigabases) (5,000,000 processor hours)

8 Petascale Capability Problems in Biology Membrane channel simulation Simulate non-flexible protein ion channel K+ flow using quantum methods (2,200,000) processor hours for 4 second simulation Now In 5 years Simulate flexible protein ion pump for producing ATP from K+ gradient (15,000,000 processor hours for 200 nanosecond simulation

9 2. Data Capture and Archiving 4. Modeling and Simulation 3. Data Analysis / Reduction 1. LIMS & Workflow Management 5. The Community Data Resource Computing Capabilities for GTL Facilities and Projects 6. Infrastructure Collaborative Projects Facilities

10 High-Performance Computing Roadmap for the Genomics: GTL Program Biological Complexity Comparative Genomics Constraint-Based Flexible Docking 1000 TF 100 TF 10 TF 1 TF* Constrained rigid docking Genome-scale protein threading Community metabolic regulatory, signaling simulations Molecular machine classical simulation Protein machine Interactions Cell, pathway, and network simulation Molecule-based cell simulation *Teraflops Current U.S. Computing

11 Genomics: GTL – A Vision of Systems Biology Research In 10-15 years we would like to be able to start with a microbe or microbial community of interest and in a matter of days or weeks: Generate an annotated DNA sequence Produce proteins and molecular tags for most/all proteins Identify the majority of multi protein complexes Generate a working regulatory network model Identify the biochemical capabilities Design reengineering or control strategies in silico

12 Capabilities Needed: Map experimental strategies to distributed resources and instrument protocols Coordinate experimental process management across cyber collaboratories Track the process - sample tracking metadata Dynamically optimize experiment workflow Process and controls documentation / QA Localize problems with data production quality Share process data across facilities or projects Make production-scale collaborative science possible 1. LIMS and Workflow Management Track and capture metadata

13 R & D Challenges and Technologies Approaches to coordinated process design, optimization, protocol mapping for a large distributed enterprise Explore LIMS and workflow management systems technology including commercial systems – modify? Explore approaches to process documentation and control, QA/QC, and process metadata representation – make data reproducible Develop Collaborative tools, electronic notebooks, web servers for shared access to laboratory data 1. LIMS and Workflow Management

14 Capabilities Needed: Capture bulk data and metadata from many different measurements and instruments in shared large-scale data archives Represent Complex Non-standard Data types: mass spectrometry, light microscopy, cryo EM, expression, biophysical & biochemical characterization data… Capture and represent data quality, statistical reliability measures, process metadata Support deposition, access, transfer and retrieval for archives of multi-petabyte size Raw data sets Swimming in Data 2. Data Capture and Archiving

15 Developing representations and models for data and metadata from many different measurements and assays; confocal images, video, mass spec, 3D Cryo-EM,... Developing data exchange and format standards for facilities and the community Hardware infrastructure for rapid and flexible access to very large (petabyte) data volumes. Research new data storage technologies. Research approaches to design, query and retrieval efficiency in large datasets and with non-standard data types R & D Challenges and Technologies 2. Data Capture and Archiving Raw data sets

16 Capabilities Needed: Process data from instruments such as mass spectrometers, microscopes, NMR, etc., to reduce and analyze data; e.g.; Automatically identify interacting protein events in FRET confocal microscopy Identify peptides, proteins, PTMs of interest in mass spectrometry data Quantitate changes in / cluster expression data from arrays or mass spectrometry Compare metabolite levels under different cell conditions 3. Data Analysis and Reduction

17 R & D Challenges and technologies Many types of data, each with algorithm research and development challenges for analyzing data, basic algorithm research needed! e.g.; - Automated processing of images and video about protein cell localization to achieve analysis high-throughput - New mass spectrometry algorithms to identify post- translational modifications, cross-linked peptides, and new proteins (De Novo MS), and to automate quantitiation - Analysis of NMR, Scattering, AFMs.. Analysis throughput likely to be an issue; Research on Grid analysis approaches and codes for large clusters and MPP environments Approaches to Tools Libraries and Repositories Develop and adopt software engineering principles and practices for GTL software development; modular, open source 3. Data Analysis and Reduction

18 Capabilities Needed: Build models of biology that capture our knowledge, based on a combination of experimental data types, and validate these models, use them to predict. e.g.; Build regulatory network topology from observations of protein expression based on conditions Build a protein-protein interaction network from protein interaction data of several types Build a model for the organization of a protein complex from homology modeling, geometry constraints from mass spec, and cryo-EM images Build cell models that combine regulation, metabolism and protein interactions 4. Modeling and Simulation

19 R & D Challenges & Technologies Synthesis; How to infer or reconstruct systems from data – build “optimal” model Metabolic pathways from metabolic data & genome Regulatory networks from expression data Protein interaction networks from binary interaction data How to integrate different types of data into models Integration of different imaging modalities Integration of metabolism, regulation, and protein interactions into cell models How to derive best interaction networks from raw binary interaction data, cell interaction images, predicted interactions, and co-expression data... 4. Modeling and Simulation Capture human modes of integration to automate

20 R & D Challenges (cont’d) How to mathematically represent biology – pathways, networks, communities What’s the right calculus to describe regulation / metabolism / protein interaction networks / signaling / that allows quantitative prediction? Differential equations? Stochastic or deterministic? Control theory or Ad hoc mathematical networks? Binary or discrete value networks? Chaos theory? “Need for new abstractions” In what regimes do they work and where they fail? How do we deal with missing data, incomplete knowledge, or errors? Are there organizing principles or theory that could make us successful with incomplete knowledge? How to get to longer compute times for physics based simulations (millisecs and beyond)- steer and sample 4. Modeling and Simulation

21 Capabilities Needed: Provide community access to data, models, simulations, and protocols for GTL. Allow users to query and visualize data, use models, run simulations. Community resources for multiple types of data - machines, interactions, process models, expression, regulation, genome annotation, metabolism, regulation,… Access to: data protocols and methods analysis tools and user environments models and simulations Access to multiple levels of data - raw data, processed results, dynamic models Integrated view of the biology represented Guide experimental design strategy for next microbe “The GTL Knowledge Base” 5. Community Data Resource

22 R & D Challenges and Technologies Design and Integration of the major databases Huge data volumes, great schema complexity - need for new types of databases (hardware and software) Database technologies – object-relational, graph DBs, … Data standards, representations, ontologies for very complex objects User Access Systems for browsing, query, visualization, and to run analysis or simulations Supporting Simulation from DBs - how to allow users to utilize models and run simulations; how to link simulations to underlying data Integration - Provide integrated view of the biology - With data from other community sources. Community access to compute power to run long time- scale simulations IP issues and reward system How to represent incomplete, sparse, conflicting data 5. Community Data Resource

23 Objective: Provide hardware and software environments to support analysis, data storage, modeling and simulation activities required in GTL Examples of Infrastructure: Hardware, network and operative system environments for peta-scale data storage and retrieval. Grid computing environments to support distributed large-scale data analysis operations. Massively parallel architectures for systems simulation. Discrete mathematics libraries 6. Infrastructure

24 http://DOEGenomesToLife.org


Download ppt "DOE Genomics: GTL Program IT Infrastructure Needs for Systems Biology David G. Thomassen Office of Biological and Environmental Research DOE Office of."

Similar presentations


Ads by Google