Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International

Similar presentations


Presentation on theme: "Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International"— Presentation transcript:

1 Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International pkarp@ai.sri.com

2 Pathway/Genome Database Integrating Genomic and Biochemical Data Chromosomes, Plasmids Genes Proteins Reactions Pathways Compounds CELL Operons, Promoters, DNA Binding Sites

3 Key Functionality Pathway analysis l Prediction of pathways from genomes l Comparative pathway analysis Ongoing curation of PGDBs WWW publishing of PGDBs Analysis of gene expression data

4 Tools and Datasets PGDB PathwaysGenes Pathway/Genome Navigator PathoLogic Editors Create PGDBs Visualize, Query and Analyze PGDBs Update PGDBs

5 PathoLogic Pathway Predictor New PGDB Set of Annotated Genes Pathway Prediction MetaCyc PGDB Reports

6 Prediction of Pathways from Genomes Pathways Compounds Genomic Map Genes Proteins Reactions Metabolic Network Pathway/Genome Database DNA Sequence List of Genes/ORFs List of Gene Products Annotated Genome PathoLogic

7 MetaCyc Overview Meta Metabolic Encyclopedia 439 pathways, 1095 enzymes, 4217 reactions l 173 E. coli pathways Literature-based DB with extensive references and commentary Pathways, reactions, enzymes, substrates Editor in chief: Dr. Monica Riley

8 Pathway/Genome Navigator Query and visualization tools for PGDBs l Metabolic pathways, reactions, compounds l Enzymes, transporters, transcription factors l Genome maps, genes, operons, promoters, DNA sites l Retrieve nucleotide and DNA sequences l Perform Blast searches Runs as an application on Solaris, Windows Runs as a WWW server on Solaris Query and comparative analysis functions

9 Interactive Editing Tools Pathway editor Reaction editor Gene editor Enzyme editor Compound editor Transcription Unit Editor Facilitate updates to PGDBs l Improved computational predictions l Literature-based data Record citations, comments, evidence, history

10 Pathway Views of Expression Data Import gene expression data Compute expression ratios Obtain pathway based visualizations of data l Numerical spectrum of expression values mapped to a color spectrum l Steps of overview painted with color corresponding to expression level(s) of genes that encode enzyme(s) for that step l Absolute or relative expression values

11 Environment for Computational Exploration of Genomes Powerful ontology opens many facets of the biology to computational exploration Global characterization of metabolic network Analysis of interface between transport and metabolism Nutrient analysis of metabolic network

12 PathoLogic Pathway Predictor

13 Pathologic Pathway Predictor Introduction Description of PPP execution Inputs to PPP Using the GUI to create a pathway/genome database Output from PPP Caveats

14 PathoLogic Goals Create the set of class frames that encode DB schema l Copied from MetaCyc Create the appropriate set of instance frames l Genes, genetic elements, proteins created from input files l Substrates, reactions, and pathways are copied from the reference database Interconnect frames in a manner that accurately reflects their semantic relationships

15 PathoLogic Input/Output Inputs: l File listing genetic elements u http://bioinformatics.ai.sri.com/ptools/genetic-elements.dat l Files containing DNA sequence for each genetic element l Files containing annotation for each genetic element l MetaCyc database Output: l Pathway/genome database for the subject organism l Directory tree for the subject organism l Reports that summarize: u Evidence contained in the input genome for the presence of reference pathways u Reactions missing from inferred pathways

16 Inputs to PathoLogic Pathway Predictor genetic-elements.dat Sequence files GenBank file format PathoLogic format Directory Structure

17 genetic-elements.dat ID TEST-CHROM-1 NAME Chromosome 1 TYPE :CHRSM CIRCULAR? N ANNOT-FILE chrom1.pf SEQ-FILE chrom1.fsa // ID TEST-CHROM-2 NAME Chromosome 2 CIRCULAR? N ANNOT-FILE /mydata/chrom2.gbk SEQ-FILE /mydata/chrom2.fna //

18 File Naming Conventions One pair of sequence and annotation files for each genetic element Sequence files: FASTA format l suffix fsa or fna Annotation file: l Genbank format: suffix.gbk l PathoLogic format: suffix.pf

19 GenBank File Format Accepted feature types: l CDS, tRNA, rRNA, misc_RNA Accepted qualifiers: l /labelUnique ID [recm] l /geneGene name [req] l /product [req] l /EC_number [recm] l /product_comment [opt] l /gene_comment [opt] l /alt_nameSynonyms [opt] For multifunctional proteins, put each function in a separate /product line

20 Typical Problems Using Genbank Files With PathoLogic Wrong qualifier names used Extraneous information in a given qualifier Check results of trial parse carefully

21 PathoLogic File Format Each record starts with line containing an ID attribute Tab delimited Each record ends with a line containing // One attribute-value pair is allowed per line l Use multiple FUNCTION lines for multifunctional proteins Lines starting with ‘;’ are comment lines Valid attributes are: l ID, NAME, SYNONYM l STARTBASE, ENDBASE, GENE-COMMENT l FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT l DBLINK

22 PathoLogic File Format IDTP0734 NAMEdeoD STARTBASE799084 ENDBASE799785 FUNCTIONpurine nucleoside phosphorylase DBLINK PID:g3323039 PRODUCT-TYPE P GENE-COMMENTsimilar to GP:1638807 percent identity: 57.51; identified by sequence similarity; putative // IDTP0735 NAMEgltA STARTBASE799867 ENDBASE801423 FUNCTIONglutamate synthase DBLINK PID:g3323040 PRODUCT-TYPE P

23 Using the PPP GUI to Create a Pathway/Genome Database Input Project Information l Organism -> Create New Trial Parse l Build -> Trial Parse Build pathway/genome database l Build -> Automated Build Manual polishing l Refine -> Resolve Ambiguous Name Matches l Refine -> Assign Modified Proteins l Refine -> Create Protein Complexes l Refine -> Run Consistency Checker l Refine -> Update Overview

24 PathoLogic Command Menus Organism l Select l Create New l Save KB l Revert KB l Reinitialize KB l Exit Build l Trial Parse l Automated Build Refine l Resolve Ambiguous Name Matches l Assign Modified Proteins l Create Protein Complexes l Re-run Name Matcher l Rescore Pathways l Run Consistency Checker l Update Overview

25 Input Project Information

26 PathoLogic PP Parse Output

27 Enzyme Name to Reaction Mapping

28 Enzyme Name Matching Tool Dictionary of enzyme names assembled from: l All metabolic reactions found in MetaCyc l Two files that map synonyms not found in MetaCyc to reaction names: u System file (pangea-enzyme-mappings.dat) u User-supplied file (local-enzyme-mappings.dat) Location of sources: l $GPROOT/pathologic/$VERSION-NUMBER/data

29 Enzyme Name Matcher Matches on full enzyme name Match is case-insensitive and removes the punctuation characters “ -_(){}',:” Also matches after removal of prefixes and suffixes such as: l “Putative”, “Hypothetical”, etc l alpha|beta|…|catalytic|inducible chain|subunit|component l Parenthetical gene name

30 Enzyme Name Matcher For names that do not match, software identifies probable metabolic enzymes as those l Containing “ase” l Not containing keywords such as u “sensor kinase” u “topoisomerase” u “protein kinase” u “peptidase” u Etc Research unknown enzymes l MetaCyc, Swiss-Prot, PIR, Medline, EMP

31 Assigning Evidence Scores to Predicted Pathways X|Y|Z denotes score for P in O l where: u X = total number of reactions in P u Y = enzymes catalyzing number of reactions for which there is evidence in O u Z = number of Y reactions that are used in other pathways in O Not clear how to convert these scores into a probability of occurrence

32 Algorithm for Automated Pathway Pruning A pathway will never be pruned if it contains a unique enzyme – an enzyme not present in any other pathway A pathway will be pruned if one of the following conditions holds: l Evidence is better for a different pathway in same variant set l Evidence for only one reaction in pathway, or l Its set of reactions present is a proper subset of the reactions present in some other pathway, and u If pathway is a biosynthetic pathway, final reaction(s) missing u If pathway is a degradation pathway, initial reaction(s) missing u If pathway is an energy metabolism pathway, more than half the reactions are missing

33 Creating Protein Complexes

34 Complex Subunits Stoichiometries

35 Proteins as Reaction Substrates

36 Manual Pruning of Pathways Use pathway evidence report l Coloring scheme aids in assessing pathway evidence Phase I: Prune extra variant pathways Rescore pathways, re-generate pathway evidence report Phase II: Prune pathways unlikely to be present l No/few unique enzymes l Most pathway steps present because they are used in another pathway l Pathway very unlikely to be present in this organism

37 Overview Graph

38 Output from PPP Pathway/genome database Summary pages l Pathway evidence page u Click “Summary of Organisms”, then click organism name, then click “Pathway Evidence”, then click “Save Pathway Report” l Missing enzymes report Directory tree containing sequence files, reports, etc.

39 Resulting Directory Structure ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/ l input u organism.dat u organism-init.dat u genetic-elements.dat u annotations files u sequence files l reports u name-matching-report.txt u trial-parse-report.txt l kb u ORGIDbase.ocelot l data u overview.graph l released -> VERSION

40 Caveats Cannot predict pathways not present in MetaCyc Evidence for short pathways is hard to interpret Since many reactions occur in lots of pathways, many false positives

41 The Pathway Tools Schema

42 Motivations for Understanding Schema Pathway Tools visualizations and analyses depend upon the software being able to find precise information in precise places within a Pathway/Genome DB When writing Lisp complex queries to PGDBs, those queries must name classes and slots within the schema A Pathway/Genome Database is a web of interconnected objects; each object represents a biological entity

43 Reference Pathway Tools User’s Guide, Volume I l Appendix A: Guide to the Pathway Tools Schema

44 Web of Relationships for One Enzyme Sdh-flavoSdh-Fe-SSdh-membrane-1Sdh-membrane-2 sdhAsdhB sdhCsdhD Succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle

45 Frame Data Model and Schema Frame Data Model -- organizational principle for a DB Object Displays Schema l Gene slots l Polypeptide slots l Protein slots l Protein Complex slots l Reaction slots l Enzymatic Reaction slots

46 Frame Data Model Knowledge base (KB, Database, DB) Frames Slots Facets Annotations

47 Knowledge Base Collection of frames and their associated slots, values, facets, and annotations Can be stored within l An Oracle DB l A disk file l A Pathway Tools binary program

48 Frames Entities with which facts are associated Kinds of frames: l Classes: Genes, Pathways, Biosynthetic Pathways l Instances (objects): trpA, TCA cycle Classes: l Superclass(es) l Subclass(es) l Instance(s) A symbolic frame name (id, key) uniquely identifies each frame

49 Slots Encode attributes/properties of a frame l Integer, real number, string Represent relationships between frames l The value of a slot is the identifier of another frame Every slot is described by a “slot frame” in a KB that defines meta information about that slot

50 Slot Links Sdh-flavoSdh-Fe-SSdh-membrane-1Sdh-membrane-2 sdhAsdhB sdhCsdhD Succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle product component-of catalyzes reaction in-pathway

51 Slots Number of values l Single valued l Multivalued: sets, bags Slot values l Any LISP object: Integer, real, string, symbol (frame name), list Slotunits define properties of slots: datatypes, classes, constraints Two slots are inverses if they encode opposite relationships l Slot Product in class Genes l Slot Gene in class Polypeptides

52 Representation of Function Sdh-flavoSdh-Fe-SSdh-membrane-1Sdh-membrane-2 sdhAsdhB sdhCsdhD Succinate + FAD = fumarate + FADH 2 Enzymatic-reaction Succinate dehydrogenase TCA Cycle EC# K eq Cofactors Inhibitors Molecular wt pI Left-end-position

53 Monofunctional Monomer Gene Reaction Enzymatic-reaction Monomer Pathway

54 Bifunctional Monomer Gene Reaction Enzymatic-reaction Monomer Pathway Reaction Enzymatic-reaction

55 Monofunctional Multimer Monomer Gene Reaction Enzymatic-reaction Multimer Pathway

56 Pathway and Substrates Reactant-1 Reaction Pathway Reaction Reactant-2 Product-2 Product-1 in-pathway left right

57 Transcriptional Regulation site001 pro001 trpE trpD trpC trpB trpA trpL Int003RpoSig70 TrpR*trpInt001 trpLEDCBA trp apoTrpR Int005

58 Annotations Encode information about individual slot values Used to attach comments and citations to slot values Example: l Frame tryptophan-synthetase has a slot called Molecular- Weight with a value of 28 l Attached to that value is an annotation whose label is Citation and whose value is “[3444332]”

59 Facets Encode information about slots Allow association between a slot and: l comments l citations Example: Comment attached to Inhibitors of EnzRxn Allow access to schema information

60 Principle Classes Class names are capitalized, plural Genetic-Elements, with subclasses: l Chromosomes l Plasmids Genes Transcription-Units RNAs Proteins, with subclasses: l Polypeptides l Protein Complexes

61 Principle Classes Reactions, with subclasses: l Transport-Reactions Enzymatic-Reactions Pathways Compounds-And-Elements

62 Slots in Multiple Classes Common-Name Synonyms Names (computed as union of Common-Name, Synonyms) Comment Citations DB-Links

63 Genes Slots Chromosome Left-End-Position Right-End-Position Centisome-Position Transcription-Direction Product

64 Proteins Slots Molecular-Weight-Seq Molecular-Weight-Exp pI Locations Modified-Form Unmodified-Form Component-Of

65 Polypeptides Slots Gene

66 Protein-Complexes Slots Components

67 Reactions Slots EC-Number Left, Right Substrates (computed as union of Left, Right) DeltaG0 Keq Spontaneous? Species

68 Enzymatic-Reactions Slots Enzyme Reaction Activators Inhibitors Physiologically-Relevant Cofactors Prosthetic-Groups Alternative-Substrates Alternative-Cofactors

69 Editing Pathway/Genome Databases

70 Pathway Tools Paradigm Separate database from user interface Navigator provides one view of the DB Editors provide an alternative view of the DB

71 Invoking the Editors Right-Click on an Object Handle l Edit l Notes l Show Shift-Middle-Click on an Object Handle

72 Saving Changes The user must save changes explicitly with Save KB To discard changes made since last save l Special -> KB -> Revert KB

73 Administering the Pathway Tools

74 Information Sources Pathway Tools User’s Guide l aic-export/ecocyc/genopath/released/doc/userguide1.pdf u Appendix A: Guide to the Pathway Tools Schema l aic-export/ecocyc/genopath/released/doc/userguide2.pdf Pathway Tools Web Site l http://bioinformatics.ai.sri.com/ptools/ http://bioinformatics.ai.sri.com/ptools/ Pathway Tools Tutorial l http://bioinformatics.ai.sri.com/ptools/tutorial/ http://bioinformatics.ai.sri.com/ptools/tutorial/

75 Reporting Problems E-mail to ptools-support@ai.sri.comptools-support@ai.sri.com Include: l Error message l Result of :zoom :count :all l What version and platform you are running l What operation were you performing when the error occurred?


Download ppt "Overview of the Pathway Tools Software and Pathway/Genome Databases Peter D. Karp Bioinformatics Research Group SRI International"

Similar presentations


Ads by Google