Presentation is loading. Please wait.

Presentation is loading. Please wait.

New Developments in the Pathway Tools Software and EcoCyc Database Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International

Similar presentations


Presentation on theme: "New Developments in the Pathway Tools Software and EcoCyc Database Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International"— Presentation transcript:

1 New Developments in the Pathway Tools Software and EcoCyc Database Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International pkarp@ai.sri.com BioCyc.org EcoCyc.org MetaCyc.org HumanCyc.org

2 SRI International Bioinformatics SRI International Private nonprofit research institute No permanent funding sources 1200 staff in Menlo Park Multidisciplinary – Founded in 1946 as Stanford Research Institute – Separated from Stanford University in 1970 – Name changed to SRI International in 1977 – David Sarnoff Research Center acquired in 1987

3 SRI International Bioinformatics SRI Organization Information and Computing Sciences Engineering Systems And Sciences Physical Sciences BioSciences Education and Policy

4 SRI International Bioinformatics Overview Motivations and terminology l Refine rationale for MODs Overview of Pathway Tools New Developments in Pathway Tools and EcoCyc

5 SRI International Bioinformatics Model Organism Databases DBs that describe the genome and other information about an organism Every sequenced organism with an active experimental community requires a MOD l Integrate genome data with information about the biochemical and genetic network of the organism l Integrate literature-based information with computational predictions Curated by experts for that organism l No one group can curate all the world’s genomes l Distribute workload across a community of experts to create a community resource

6 SRI International Bioinformatics Rationale for MODs Each “complete” genome is incomplete in several respects: l 40%-60% of genes have no assigned function l Roughly 7% of those assigned functions are incorrect l Many assigned functions are non-specific Need continuous updating of annotations with respect to new experimental data and computational predictions l Gene positions, sequence, gene functions, regulatory sites, pathways MODs are platforms for global analyses of an organism l Interpret omics data in a pathway context l In silico prediction of essential genes l Characterize systems properties of metabolic and genetic networks

7 SRI International Bioinformatics Potential MOD Authors Sequencing center that sequenced genome Experimentalists that work with that organism Computational biologists who want to perform global and/or comparative analyses

8 SRI International Bioinformatics BioCyc Collection of Pathway/Genome Databases Pathway/Genome Database (PGDB) – combines information about l Pathways, reactions, substrates l Enzymes, transporters l Genes, replicons l Transcription factors/sites, promoters, operons Tier 1: Literature-Derived PGDBs l MetaCyc l EcoCyc -- Escherichia coli K-12 l BioCyc Open Chemical Database Tier 2: Computationally-derived DBs, Some Curation -- 18 PGDBs l HumanCyc l Mycobacterium tuberculosis Tier 3: Computationally-derived DBs, No Curation -- 145 DBs

9 SRI International Bioinformatics BioCyc Tier 3 145 PGDBs l 130 prokaryotic PGDBs created by SRI u Source: CMR database l 15 prokaryotic and eukaryotic PGDBs created by EBI u Source: UniProt Automated processing by PathoLogic l Pathway prediction l Operon prediction (bacteria)

10 SRI International Bioinformatics Pathway/Genome Database Chromosomes, Plasmids Genes Proteins Reactions Pathways Compounds CELL Operons, Promoters, DNA Binding Sites

11 SRI International Bioinformatics Pathway Tools Software Pathway/ Genome Databases Pathway/Genome Navigator PathoLogic Pathway Predictor Pathway/ Genome Editors

12 SRI International Bioinformatics Pathway Tools Modes of Use Majority of MOD services provided by Pathway Tools Pathway Tools provides a pathway module as an add-on to existing MOD

13 SRI International Bioinformatics Pathway Tools Software: PathoLogic Computational creation of new Pathway/Genome Databases Transforms genome into Pathway Tools schema and layers inferred information about the genome Predicts operons Predicts metabolic network Predicts pathway hole fillers Bioinformatics 18:S225 2002

14 SRI International Bioinformatics Pathway Tools Software: Pathway/Genome Editors Support interactive updating of PGDBs with graphical editors Support geographically distributed teams of curators with object database system Gene editor Protein editor Reaction editor Compound editor Pathway editor Operon editor Publication editor

15 SRI International Bioinformatics Pathway Tools Software: Pathway/Genome Navigator Querying, visualization of pathways, chromosomes, operons Analysis operations l Pathway visualization of gene- expression data l Global comparisons of metabolic networks l Comparative genomics WWW publishing of PGDBs Desktop operation

16 SRI International Bioinformatics Pathway/Genome DBs Created by External Users 50 groups applying the software to more than 80 organisms Software freely available to academics; Each PGDB owned by its creator Saccharomyces cerevisiae, SGD project, Stanford University l pathway.yeastgenome.org/biocyc / TAIR, Carnegie Institution of Washington Arabidopsis.org:1555 dictyBase, Northwestern University GrameneDB, Cold Spring Harbor Laboratory Planned: l CGD (Candida albicans), Stanford University l MGD (Mouse), Jackson Laboratory l RGD (Rat), Medical College of Wisconsin l WormBase ( C. elegans ), Caltech DOE Genomes to Life contractors: l G. Church, Harvard, Prochlorococcus marinus MED4 l E. Kolker, BIATECH, Shewanella onedensis l J. Keasling, UC Berkeley, Desulfovibrio vulgaris Plasmodium falciparum, Stanford University l plasmocyc.stanford.edu Fiona Brinkman, Simon Fraser Univ, Pseudomonas aeruginosa Methanococcus janaschii, EBI maine.ebi.ac.uk:1555

17 SRI International Bioinformatics Computing with the Metabolic Network Comparative analysis of metabolic networks Visualization of omics data Correlation of metabolism and transport Connectivity analysis of metabolic network Forward propagation of metabolites Verification of known growth media with metabolic network (Future) Infer growth-media requirements

18 SRI International Bioinformatics Pathway Tools Implementation Details Platforms: l Sun, PC/Linux, and PC/Windows platforms Same binary can run as desktop app or Web server Production-quality software l Version control l Two regular releases per year l Extensive quality assurance l Extensive documentation l Auto-patch l Automatic DB-upgrade 300,000 lines of code

19 SRI International Bioinformatics Pathway Tools Architecture Object DBMS GFP API Pathway Genome Navigator WWW Server X-Windows Graphics Object Editor Pathway Editor Reaction Editor Oracle

20 SRI International Bioinformatics Ocelot Knowledge Server Architecture Frame data model l Classes, instances, inheritance l Frames have slots that define their properties, attributes, relationships l A slot has one or more values u Datatypes include numbers, strings, etc. Transaction logging facility Slot units define metadata about slots: l Domain, range, inverse l Collection type, number of values, value constraints

21 SRI International Bioinformatics Ocelot Storage System Architecture Persistent storage via disk files, Oracle DBMS l Concurrent development: Oracle l Single-user development: disk files Oracle storage l Oracle is submerged within Ocelot, invisible to users l Frames transferred from DBMS to Ocelot u On demand u By background prefetcher u Memory cache u Persistent disk cache to speed performance via Internet Transaction logging facility

22 SRI International Bioinformatics The Common Lisp Programming Environment Gatt studied Lisp and Java implementation of 16 programs by 14 programmers (Intelligence 11:21 2000)

23 SRI International Bioinformatics Peter Norvig’s Solution “I wrote my version in Lisp. It took me about 2 hours (compared to a range of 2-8.5 hours for the other Lisp programmers in the study, 3-25 for C/C++ and 4-63 for Java) and I ended up with 45 non-comment non-blank lines (compared with a range of 51-182 for Lisp, and 107-614 for the other languages). (That means that some Java programmer was spending 13 lines and 84 minutes to provide the functionality of each line of my Lisp program.)” http://www.norvig.com/java-lisp.html

24 SRI International Bioinformatics Common Lisp Programming Environment Interpreted and/or compiled execution Fabulous debugging environment High-level language Interactive data exploration Extensive built-in libraries Dynamic redefinition Find out more! l See ALU.org or l http://www.international-lisp-conference.org/http://www.international-lisp-conference.org/

25 PathoLogic Processing of a Genome

26 SRI International Bioinformatics PathoLogic Inference of Metabolic Pathways Pathway/Genome Database Annotated Genomic Sequence Genes/ORFs Gene Products DNA Sequences Reactions Pathways Compounds Multi-organism Pathway Database (MetaCyc) PathoLogic Software Integrates genome and pathway data to identify putative metabolic networks Genomic Map Genes Gene Products Reactions Pathways Compounds

27 SRI International Bioinformatics PathoLogic: Predict Metabolic Pathways Computationally match enzymes in source genome to the MetaCyc reactions that they catalyze l Match enzyme names and EC numbers to MetaCyc l Support user in manually matching additional enzymes Computationally predict which MetaCyc metabolic pathways are present in the organism l Import MetaCyc pathways based on fraction of enzymes present, and presence of enzymes unique to that pathway Generate report of predicted pathways and the supporting evidence; mark predicted pathways with computational evidence code Generate metabolic overview diagram

28 SRI International Bioinformatics HumanCyc Results 2709 enzymes identified in the human genome (9.5%) l 1653 metabolic enzymes l Plus 203 pathway holes -> 6.5% of genome l 622 of metabolic enzymes assigned to a metabolic pathway 135 predicted metabolic pathways l 203 pathway holes present in 99 pathways l 88 candidate hole fillers found, of which 25 appear solid l Average pathway length: 5.4 reaction steps l 428 of 896 reactions have multiple isozymes

29 SRI International Bioinformatics PathoLogic Step 3: Identify Pathway Hole Fillers Definition: Pathway Holes are reactions in metabolic pathways for which no enzyme is identified holes L-aspartate iminoaspartate quinolinate nicotinate nucleotide deamido-NAD NAD 1.4.3.- 2.7.7.18 quinolinate synthetase nadA n.n. pyrophosphorylase nadC NAD+ synthetase, NH3 - dependent CC3619 6.3.5.1

30 SRI International Bioinformatics organism 1 enzyme A organism 2 enzyme A organism 5 enzyme A organism 4 enzyme A organism 3 enzyme A organism 8 enzyme A organism 7 enzyme A organism 6 enzyme A Step 1: collect query isozymes of function A based on EC# Step 2: BLAST against target genome Step 3 & 4: Consolidate hits and evaluate evidence 7 queries have high-scoring hits to sequence Y gene Y gene X gene Z

31 SRI International Bioinformatics Bayes Classifier P(protein has function X| E-value, avg. rank, aln. length, etc.) protein has function X % of query aligned avg. rank in BLAST output bestE-value Number of queries pwy directon adjacent rxns

32 SRI International Bioinformatics Pathway Hole Filler Why should hole filler find things beyond the original genome annotation? Reverse BLAST searches more sensitive Reverse BLAST searches find second domains Integration of multiple evidence types

33 SRI International Bioinformatics Example Pathway holes L-aspartate iminoaspartate quinolinate nicotinate nucleotide deamido-NAD NAD 1.4.3.- 2.7.7.18 quinolinate synthetase nadA (CC2912) n.n. pyrophosphorylase nadC (CC2915) NAD+ synthetase, NH3 - dependent CC3619 6.3.5.1 CC2913, P=0.99 CC3431*, P=0.90 CC3619, P=0.99 CC2913 L-aspartate oxidase (wrong EC# on rxn) CC3431 ORF CC3619 put. NAD(+)-synthetase (multidomain)

34 SRI International Bioinformatics HumanCyc Pathway Holes Fill holes by predicting the probability that a gene has a particular function 135 pathways containing 538 reactions 99 pathways w/ at least 1 missing reaction 203 reactions have missing enzymes HumanCyc holes filled: No candidates found for 115 of the 203 holes 25 of 88 candidates judged to have strong evidence: l 6 ORFs l 9 multifunctional enzymes l 3 enzymes with different functional assignments l 7 enzymes with imprecise functional assignments

35 SRI International Bioinformatics PathoLogic Step 4: Predict Operons Predict adjacent genes A and B in same operon based on: l Intragenic distance l Functional relatedness of A and B Tests for functional relatedness: l A and B in same gene functional class (MultiFun) l A and B in same metabolic pathway l A codes for enzyme in a pathway and B codes for transporter involving a substrate in that pathway l A and B are monomers in same protein complex Correctly predicts 80% of E. coli transcription units Marks predicted operons with computational evidence codes Bioinformatics 20:709-17 2004

36 SRI International Bioinformatics Pathway Tools APIs and Semantic Inference Layer APIs l Generic Frame Protocol (Lisp) u Database query and update operations u Get-class-all-instances, Get-slot-values, Add-slot-value l PerlCyc l JavaCyc Semantic inference layer l Encode commonly used queries that compute indirect DB relationships u Genes-Of-Pathway, Substrates-Of-Pathway u All-Transcription-Factors, Regulon-Of-Protein

37 SRI International Bioinformatics Other Capabilities Evidence code ontology l 34 codes that can be attached to many object types l Pacific Symposium on Biocomputing pp190-201 2004 APIs l JavaCyc, PerlCyc, Lisp Extensive data import/export tools l Export select objects and attributes to column-delimited files Easy to define Web links from PGDB objects Extensive user support services through SRI l Auto-patch l 200 pages of documentation available: User’s Guide, Schema, Curator’s Guide Active community of contributors l JavaCyc, PerlCyc l SBML and BioPAX export tools

38 SRI International Bioinformatics Pathway Tools Recent Developments Two releases per year in Feb and Aug Version 8.0 l Pathway hole filler l Protein features: schema, query, visualization, editing l Navigator main menu redesigned Version 8.5 l Licensing completely online l Cellular Overview and Omics Viewer Improved u Users can create combined displays of gene expression, proteomics, metabolomics, and reaction flux measurements on the Omics Viewer u Drawing speed is improved u Metabolic pathways in the Overview are now grouped by pathway class u Zooming of the diagram is supported (desktop version only) u The periplasm and outer membrane have been added to the diagram, as have those proteins present in the periplasm and outer membrane u The layout of the Cellular Overview can be computed completely automatically by PathoLogic in a new PGDB l Compound stereochemistry supported l Support for JME chemical editor, molfile import/export

39 SRI International Bioinformatics Pathway Tools Recent Developments Version 9.0 l New genome browser l More compact pathway diagrams

40 SRI International Bioinformatics EcoCyc Project – EcoCyc.org E. coli Encyclopedia l Model-Organism Database for E. coli l Computational symbolic theory of E. coli l Electronic review article for E. coli u 10,500 literature citations u 3600 protein comments l Tracks the evolving annotation of the E. coli genome l Resource for microbial genome annotation Collaborative development via Internet l John Ingraham (UC Davis) l Paulsen (TIGR) – Transport, flagella, DNA repair l Collado (UNAM) -- Regulation of gene expression l Keseler, Shearer (SRI) -- Metabolic pathways, cell division, proteases, RNAses l Karp (SRI) -- Bioinformatics Nuc. Acids. Res. 33:D334 2005 ASM News 70:25 2004 Science 293:2040

41 SRI International Bioinformatics EcoCyc Mission Provide a review-level resource on E. coli genomics and biochemical networks l Combine parts list with computable functions of parts l Ongoing literature-based curation effort for all E. coli genes l Curate metabolic pathways l Curate transcriptional regulatory network l Provide a comprehensive, up-to-date collection of data and knowledge High-fidelity knowledge representation provides computable information Finely crafted graphical interface speeds comprehension Provide powerful bioinformatics tools for query, visualization, analysis, and curation of these data

42 SRI International Bioinformatics EcoCyc = E.coli Dataset + Pathway/Genome Navigator Genes: 4,479 Proteins: 4,273 Reactions: 3,600 Metabolic: 822 Transport: 202 Pathways: 182 Compounds: 934 http://EcoCyc.org/ Gene Regulation: Operons: 956 Trans Factors: 133 Promoters: 1015 Citations: 8,900

43 SRI International Bioinformatics EcoCyc Statistics

44 SRI International Bioinformatics Comments in Proteins, Pathways, Operons, etc.

45 SRI International Bioinformatics EcoCyc Statistics The metabolic network Several possible definitions of “metabolic network”: All biochemical reactions l Exclude signaling u Exclude transport –Exclude macromolecule pathways »Reactions for which all substrates are small molecules Preferred definition: Small-Molecule Metabolism l Reactions in pathways of small-molecule metabolism plus reactions for which all substrates are small molecules

46 SRI International Bioinformatics EcoCyc Statistics – Version 9.0 Metabolic network: l Reactions: 925 u 904 have an associated enzyme u 109 are used in more than one metabolic pathway u 139 have isozymes l Enzymes: 871 u 168 are multifunctional u 450 are monomers; 421 are multimers; 81 are heteromultimers l Substrates: 963

47 SRI International Bioinformatics EcoCyc Pathway Length Distributions

48 SRI International Bioinformatics EcoCyc Procedures DB updates performed by 5 staff curators l Information gathered from biomedical literature l Corrections submitted by E. coli researchers Review-level database (knowledge base) Four releases per year Quality assurance of data and software: l Evaluate database consistency constraints l Perform element balancing of reactions l Run other checking programs l Display every DB object

49 SRI International Bioinformatics Scientists Served by EcoCyc Experimentalists l E. coli experimentalists l Experimentalists working with other microbes l Analysis of expression data Computational biologists l Biological research using computational methods l Genome annotation u “As part of a set of tools used to annotate the Rhodococcus sp. RHA1 genome” l Global or systematic studies Bioinformaticists l Training and validation of new bioinformatics algorithms Metabolic engineers l “Design of organisms for the production of organic acids, amino acids, ethanol, hydrogen, and solvents “ Educators

50 SRI International Bioinformatics EcoCyc Accelerates Science Computational biology research using EcoCyc l Microbial genome annotation l Study topological organization of E. coli metabolic network l Study organization of E. coli metabolic enzymes into structural protein families l Study phylogentic extent of metabolic pathways and enzymes in all domains of life Bioinformatics research using EcoCyc as gold standard l Predict operons l Predict promoters l Predict protein functional linkages l Predict protein-protein interactions and protein-fusion events l Predict protein functions and interactions

51 SRI International Bioinformatics MetaCyc: Metabolic Encyclopedia Nonredundant metabolic pathway database Describe a representative sample of every experimentally determined metabolic pathway Literature-based DB with extensive references and commentary Pathways, reactions, enzymes, substrates Jointly developed by SRI and Carnegie Institution Nucleic Acids Research 32:D438-442 2004

52 SRI International Bioinformatics MetaCyc Curation DB updates by 4 staff curators l Information gathered from biomedical literature l Emphasis on microbial and plant pathways l More prevalent pathways given higher priority l Curator’s Guide lists curation conventions Review-level database Four releases per year Quality assurance of data and software: l Evaluate database consistency constraints l Perform element balancing of reactions l Run other checking programs l Display every DB object

53 SRI International Bioinformatics MetaCyc Data

54 BioWarehouse: The Bio-SPICE Bioinformatics Database Warehouse Peter D. Karp, Tom J. Lee, Valerie Wagner, Yannick Pouliot Oracle or MySQL UniProt ENZYME Genbank Taxonomy BioCyc CMR KEGG BioWarehouse = Java-based Loader = C-based Loader [Oracle or MySQL] UniProt ENZYME Genbank Taxonomy BioCyc CMR KEGG BioWarehouse

55 SRI International Bioinformatics Technical Approach Multi-platform support: Oracle (10G) and MySQL (3.23.58 ) Schema support for multitude of bioinformatics datatypes Create loaders for public bioinformatics DBs l Parse file format of the source DB l Semantic transformations l Insert DB contents into warehouse tables Provide Warehouse query access mechanisms l SQL queries via ODBC, JDBC, OAA

56 SRI International Bioinformatics BioWarehouse Loaders LoaderLanguageData Set genbank-loaderJAVAAll bacterial sequences in the GenBank DB uniprot-loaderJAVASwiss-Prot and TrEMBL protein DBs (XML) biocyc-loaderC BioCyc open PGDBs (e.g., B. anthracis, M. tuberculosis, V. cholerae) cmr-loaderC TIGR's Comprehensive Microbial Resource (CMR) DB of bacterial data enumerations-loaderJAVABioWarehouse’s controlled nomenclature ncbi-taxonomy-loaderCNCBI's Taxonomy DB enzyme-loaderJAVAENZYME DB of enzymatic reactions KEGG-loaderCKEGG DB of pathways Miami-expressPERLLoads microarray gene expression data in MIAMI format

57 SRI International Bioinformatics Summary Pathway/Genome Databases l MetaCyc non-redundant DB of literature-derived pathways l 165 organism-specific PGDBs available through SRI at BioCyc.org l Computational theories of biochemical machinery Pathway Tools software l Extract pathways from genomes l Morph annotated genome into structured ontology l Distributed curation tools for MODs l Query, visualization, WWW publishing

58 SRI International Bioinformatics BioCyc and Pathway Tools Availability WWW BioCyc freely available to all l BioCyc.org Most BioCyc DBs openly available l Flatfiles downloadable from BioCyc.org Pathway Tools freely available to non-profits l PC/Windows, PC/Linux, SUN

59 SRI International Bioinformatics Acknowledgements SRI l Suzanne Paley, Michelle Green, Ron Caspi, Ingrid Keseler, John Pick, Carol Fulcher, Markus Krummenacker, Alex Shearer EcoCyc Project Collaborators l Julio Collado-Vides, John Ingraham, Ian Paulsen MetaCyc Project Collaborators l Sue Rhee, Peifen Zhang, Hartmut Foerster And l Harley McAdams Funding sources: l NIH National Center for Research Resources l NIH National Institute of General Medical Sciences l NIH National Human Genome Research Institute l Department of Energy Microbial Cell Project l DARPA BioSpice, UPC BioCyc.org


Download ppt "New Developments in the Pathway Tools Software and EcoCyc Database Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International"

Similar presentations


Ads by Google