Presentation is loading. Please wait.

Presentation is loading. Please wait.

In silico biology: computational path toward holistic understanding of living cells Andrey A. Gorin Computer Science and Mathematics Oak Ridge National.

Similar presentations


Presentation on theme: "In silico biology: computational path toward holistic understanding of living cells Andrey A. Gorin Computer Science and Mathematics Oak Ridge National."— Presentation transcript:

1 In silico biology: computational path toward holistic understanding of living cells Andrey A. Gorin Computer Science and Mathematics Oak Ridge National Laboratory agor@ornl.gov

2 Motivation – Predictive Biology Developments in Biological Sciences: Experimental: From Reduction to Systems Science Computational: From Validation to Prediction Development in Technologies: High Throughput Experimentation High Performance Computing Uniqueness of Biology: First Principles Approach is Impossible/Impractical Enormous Multitude of Scales Descriptive Models from Diverse Data Large Uncertainties in Data Dynamics Kinetics Function Genes Structure Regulatory Region Coding Region POGenAGenBGenC Models 0 TB1 TB10 TB100 TB 1000 TB Structure Prediction NetworkSimulation Molecular Simulation Comparative Genomics Comparative Proteomics

3 Scientific Drivers Bio-energy Development of integrated experimental an computational approaches for: Feedstock optimization with the goal of better cellulose deconstruction by special bacterial systems Understanding of microbial communities in single batch processes (harsh environments) Enzyme or regulatory circuit design to increase desirable output Bio-remediation Environmental restoration using microbial processes requires study of: Microbial attachment to mineral environments Uptake of contaminant ions by microbes and metal reduction at the microbial membrane Conversion of toxic chemicals by microbial systems

4 Our Research Directions Mass-spectrometry based proteomics: protein identification and quantification in complex biological samples Structural models of protein complexes: docking from known components, fundamental principles of molecular recognition, prediction of protein complexes in novel genomes Network models: reconstruction from protein interactions and other sources, simple simulation to demonstrate predictive capabilities Proteomics Protein Interaction Networks Genomics Enzymatic Reactions Modeling and Simulation Protein Interaction Networks Regulatory Networks Data Analysis

5 Mass spectrometry process Exploring Protein Dimension of Bio Universe 1 out of 10 5 -10 11 must be selected as the correct peptide Entire proteome is analyzed in a few hours

6 De novo Platform: Probability Profile Method [510] VDDLSSLT [305] FPVW1 >KKRRHA…LKAAKHREVFKR FPVW2 >KAH….……….. FPVW3 ………. Database Peak Assignment Random Match Model Memory Indices m 1 m 2 m 3 m 4 … Chain Confidence We made several advancements in the understanding of the “mass spec” mathematics. Taken together they lead to conceptually novel platform. 0.51.0.8

7 Results: Capabilities in Mass Spectrometry Advances in the mathematical understanding and dramatic acceleration of fundamental operations lead to principally new capabilities Output gains, and especially in highly confident identifications. Our method gives several times more of highly confident identifications Capability to detect unexpected phenomena in the samples. We have found novel biological phenomena in the legacy data and uncovered mistakes in the data sets regarded as benchmarks. Deamidations (6 spectra) IHPFAQTQSLVYPFPGPIPN IHPFAQTQSLVYPFPGPIPD Incorrect peptides (7 spectra) VIPAADLSQQISTAGTEASGTGNMK -> VIPAADLSEQISTAGTEASGTGNMK … Disulphide bond (1 spectra) AAANFFSASCVPCADQSSFPK De novo methods can improve even manually verified benchmark data sets obtained by the existing technology.

8 Model-dependent Science Questions Structural Models Reliant: Many fundamental questions in systems biology are hampered by the lack of reliable predictive models. Network Models Reliant: What biochemical processes in a microbe are related to its traits (hydrogen or ethanol producer, ethanol resistant)? How does a bacteria degrade lignin or cellulose? What are the mechanisms behind the conversion of toxic waste to nontoxic substances by bacteria? … What are biochemical or regulatory functions of the proteins that are shown to be important for hydrogen production? What are the hot spots of cellulases that could impair their binding? What are the components of hydrogen producing protein assemblies? …

9 Predictive Model Building Transcriptomics Interactomics Quantitative Proteomics Genomics Metabolomics Network/Pathway Models Metabolic Regulatory Signaling Protein Interaction Genomics X-ray NMR Neutron ScatteringImaging Structural Models 3-d Structure Protein Docking Protein-RNA Protein-DNA Protein-Ligand

10 Protein Complexes in Genomics:GTL Express Bait Protein Exogenous / Endogenous Pulldown Mass Spectrometry Analysis Putative Interacting Proteins at High Throughput Is the interaction real or an artifact? What is the structure of the protein complex? What is its function? What is its dynamic mechanism? Can we answer these questions at scale? GTL is focused on protein interactions that make life work Xray Diffraction

11 Modeling Protein Structures and Complexes Combinatorial and optimization techniques are applied for two areas: development of knowledge based potentials and analysis of ultra large structural sets. Discovery of protein complexes Protein folding Ligand binding Graph algorithms Statistical potentials Geometry and bioinfo libraries Shared memory indices Parallel implemenations ?

12 Computational Algorithms in Structure Modeling Multitude of combinatorial optimization problems with different data access patterns. Example: Ab Initio Prediction of Protein 3-d Structure 100 GB Finding Common Motifs Known structures Knowledge-based Energy Tables 3 GB – 5 TB ROSETTA Monte Carlo protein folding Energy Optimization Decoy structures 10 3 -10 5 (10 4 ~50TB) Finding Maximal Cliques Search, Optimization, Enumeration Merging & Scoring Search Search, Optimization, Enumeration Cliques of structures Native Structures 10 GB – 500 TB

13 Results: Protein Docking 40 decoys with the theoretical probability near 0.8 have on average 60% of native contacts Full implementation for Bayesian potential energy functions for protein docking The size of the docking benchmarking set (~1,200) is unprecedented in the field. High quality results were obtained for over 70% of all tested complexes (native in top 5), indicating that the method is very efficient. Successfully modeled protein complexes using the structures from other organisms Docking calculations are scalable to 1000s processor due to surface patch approach for parallelization Up to 1000 threads

14 Future:Protein Docking Develop further Protein Interface Server (PINS) database (pins.ornl.gov). Develop theory and computational implementation for docking potentials taking into account orientation of the contacting residues (Bayesian). Petascale implementations for docking platforms. Improve mathematical methods for prediction validation. Analyze and annotate PINS interfaces related to the metabolic functions involving carbon-processing pathways in cellulose-degrading bacteria. Construct predictions for the organisms important for Bioenergy Centers in cases when full or partial genome sequence are available.

15 Predictive Model Building Transcriptomics Interactomics Quantitative Proteomics Genomics Metabolomics Network/Pathway Models Metabolic Regulatory Signaling Protein Interaction Genomics X-ray NMR Neutron ScatteringImaging Structural Models 3-d Structure Protein Docking Protein-RNA Protein-DNA Protein-Ligand

16 LDRD D07-014 Modeling Cellular Mechanisms for Efficient Bioethanol Production through Petascale Analysis of Biological Networks Andrey Gorin, Nabeela Ahmad, Andrew Bordner, Robert Day, Jessie Gu, Guruprasad Kora, Chongle Pan, Byung-Hoon Park, Nagiza Samatova, Edward Uberbacher, Cray. Inc Oak Ridge National Laboratory, FY2007-FY2008

17 Future: Graph Algorithms for Networks Analysis of graphs reflecting physical interactions in structures –Very wide set of problems, but with a uniform set of “translation” rules –Enormously large graphs –Directly connected to modeling petascale applications Developing graph representations for real cellular subsystems. Nodes: genes, proteins, DNA elements, metabolites. Edges: translation, positive regulation, production of metabolite Flexible representations Integration of many tools to annotate 5’ regions Promotor alignment Models to predict transcription patterns based on promotor models

18 Future: Mining Networks for Bioenergy Switch grass paralogs/homologs has to be identified in collections of >250,000 partial gene data (EST) from rice, maize, sorghum genomes It is expected that 6000 Populus lines will be passed through cell wall phenotyping pipeline (expression arrays, proteomics data, etc) Huge data sets are already (e.g. Obauashi et al. (2007) 1,388 expression arrays) (From Bioenergy Center proposal)

19 Future: Taking It to Petascale Port to NLCF platforms – Scale Maximal Clique Enumeration from 100s to 1000s processors –Introduce similar implementations for other codes in pGraph and BioGraphE Optimize performance on NLCF architectures: Deliver efficient MPI-based multi-core implementations Minimize thread locking/unlocking overheads Exploit data locality in job redistribution strategies Minimize I/O overheads via buffering, data compression, hierarchical parallel I/O.

20 Potential Areas of Joint Work Mathematics Statistics of pattern recognition problems Combinatorial optimization methods beyond dynamic programming Bayesian system with strong feature correlations Computer science Graph algorithms working for VERY large graphs 10 4 -10 5 nodes and across many graphs (10 3 ) in parallel Flexible software systems for representation transcription regulation networks in the bacterial cells (4 genes) Petascale implementation strategies: MPI implementations, load balancing, etc Biological Applications: Novel applications in mass-spectrometry (e.g. gene finding in new genomes, special gene expression in stem cells, mass-spec of cross-linked systems) Expression array analysis and metabolic pathway construction for bacteria involved in bioenergy processing


Download ppt "In silico biology: computational path toward holistic understanding of living cells Andrey A. Gorin Computer Science and Mathematics Oak Ridge National."

Similar presentations


Ads by Google