Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing with Pathway/Genome Databases

Similar presentations

Presentation on theme: "Computing with Pathway/Genome Databases"— Presentation transcript:

1 Computing with Pathway/Genome Databases

2 Aprox presentation time: 1.5 hrs

3 Overview Summary of Pathway Tools data access mechanisms and formats
Pathway Tools APIs Overview of Pathway Tools schema

4 Motivations to Understanding Schema
When writing complex queries to PGDBs, those queries must refer to classes and slots within the schema Queries using Lisp, Perl, Java APIs Queries using Structured Advanced Query Form Queries using BioVelo Find all monomers longer than 1,000 amino acids (loop for g in (get-class-all-instances ‘|Genes|) when (< 1000 (abs (- (get-slot-value g ‘left-end-position) (get-slot-value g ‘right-end-position) )) collect (get-slot-value g ‘product) )

5 Pathway Tools Implementation Details
Platforms: Macintosh, PC/Linux, and PC/Windows platforms Same binary can run as desktop app or Web server Production-quality software Version control Two regular releases per year Extensive quality assurance Extensive documentation Auto-patch Automatic DB-upgrade 420,000 lines of Lisp code

6 More Information Pathway Tools Web Site, Tutorial Slides
PTools APIs: Web services: Guide to the Pathway Tools Schema Curator's Guide

7 References Ontology Papers section of "An Evidence Ontology for use in Pathway/Genome Databases" "An ontology for biological function based on molecular interactions" "Representations of metabolic knowledge: Pathways" "Representations of metabolic knowledge"

8 Data Exchange APIs: Lisp API, Java API, and Perl API
Read and modify access Web services Cyclone Export to files BioPAX Export Export PGDB genome to Genbank format Export entire PGDB as column-delimited and attribute-value file formats Export PGDB reactions as SBML -- Import/Export of Pathways: between PGDBs Import/Export of Selected Frames, for Spreadsheets Import/Export of Compounds as Molfile, CML BioWarehouse : Loader for Flatfiles, SQL access BMC Bioinformatics 7:

9 Pathway Tools Ontology / Schema
Ontology classes: 1621 Datatype classes: Define objects from genomes to pathways Classification systems for pathways, chemical compounds, enzymatic reactions (EC system) Protein Feature ontology Controlled vocabularies: Cell Component Ontology Evidence codes Comprehensive set of 279 attributes and relationships

10 High-Level Classes in the Pathway Tools Ontology
Chemicals -- All molecules Polymer-Segments -- Regions of polymers Protein-Features -- Features on proteins Organisms Reactions -- Biochemical reactions Enzymatic-Reactions -- Link enzymes to reactions they catalyze Pathways -- Metabolic and signaling pathways Regulation -- Regulatory interactions CCO Cell Component Ontology Evidence Evidence ontology Gene-Ontology-Terms -- GO Growth-Observations -- Observations of growth of organism Notes Timestamped, person-stamped notes Organizations, People Publications

11 Navigating the Schema

12 Use GKB Editor to Inspect the Pathway Tools Ontology
GKB Editor = Generic Knowledge Base Editor Type in Navigator window: (GKB) or [Right-Click] Edit->Ontology Editor View->Browse Class Hierarchy [Middle-Click] to expand hierarchy To view classes or instances, select them and: Frame -> List Frame Contents Frame -> Edit Frame

13 Use the SAQP to Inspect the Schema

14 Pathway Tools Schema Guide to the Pathway Tools Schema
Schema overview diagram

15 Principal Classes Class names are capitalized, plural, separated by dashes Genetic-Elements, with subclasses: Chromosomes Plasmids Genes Transcription-Units RNAs rRNAs, snRNAs, tRNAs, Charged-tRNAs Proteins, with subclasses: Polypeptides Protein-Complexes

16 Principal Classes Reactions, with subclasses: Transport-Reactions
Enzymatic-Reactions Pathways Compounds-And-Elements

17 Principal Classes Regulation

18 Slot Links TCA Cycle in-pathway Succinate + FAD = fumarate + FADH2
Enzymatic-reaction Succinate dehydrogenase reaction catalyzes component-of Sdh-flavo Sdh-Fe-S Sdh-membrane-1 Sdh-membrane-2 sdhA sdhB sdhC sdhD product

19 Programmatic Access to BioCyc
Common LISP Native language of Pathway Tools Interactive & Mature Environment Full Access to the Data & Many Utility Functions Source code is available for academics PerlCyc API of Functions, Exposed to Perl Communication through UNIX Socket JavaCyc API of Functions, Exposed to Java Cyclone

20 Cyclone Developed by Schachter and colleagues from Genoscope
Cyclone is a Java-based system that: Extracts data from a Pathway Tools PGDB Converts it to an XML schema Maps the data to Java objects and to a relational database Changes made to the data on the Java side can be committed back to a Pathway Tools PGDB

21 Lisp API Accessible whenever you start Pathway Tools with the –lisp argument Lisp queries evaluate against the running Pathway Tools binary and execute very fast

22 Ocelot Object Database

23 Pathway Tools Implementation Details
Platforms: Macintosh, PC/Linux, and PC/Windows platforms Same binary can run as desktop app or Web server Production-quality software Version control Two regular releases per year Extensive quality assurance Extensive documentation Auto-patch Automatic DB-upgrade 600,000 lines of Lisp code

24 Pathway Tools Architecture
Genome Navigator Web Mode Desktop Mode Lisp Perl Java Protein Editor Pathway Editor Reaction Editor GFP API Oracle or MySQL Disk File Ocelot DBMS

25 Ocelot Object Database
Frame data model Classes, instances, inheritance Frames have slots that define their properties, attributes, relationships A slot has one or more values Datatypes include numbers, strings, etc. Slotunit frames define metadata about slots: Domain, range, inverse Collection type, number of values, value constraints

26 Storage System Architecture
File KBs Read-only applications can be distributed without a relational DBMS Load all objects and code into Lisp memory Dump virtual memory to binary executable file

27 Ocelot Storage System Architecture
Persistent storage via disk files, MySQL or Oracle DBMS Concurrent development: MySQL or Oracle Single-user development: disk files Relational DBMS storage RDBMS is submerged within Ocelot, invisible to users Frames transferred from RDBMS to Ocelot On demand By background prefetcher Memory cache Persistent disk cache to speed performance via Internet

28 Transaction Logging Relational DBMS stores
The latest version of each Ocelot frame A log of all GFP operations applied to KB Transaction log enables: Reconstruction of earlier versions of KB View history of changes to an object Update replicates of a KB Detection of update conflicts during concurrency control Undo of updates

29 Optimistic Concurrency Control
Locking approach: edits to one object can require locking all connected objects No locking User performs updates in local workspace When user commits changes, storage system compares user changes against all other committed changes

30 Ocelot Knowledge Server Schema Evolution
FRSs store and process class and instance information similarly Application can query schema information as easily as it can query instances Schema is stored within the DB Schema is self documenting Schema evolution facilitated by Easy addition/removal of slots, or alteration of slot datatypes Flexible data formats that do not require dumping/reloading of data

31 Generic Frame Protocol (GFP)
A library of procedures for accessing Ocelot DBs GFP specification: A small number of GFP functions are sufficient for most complex queries

32 Example of a Single GFP Call
The General Pattern: gfp-function(frame slot value ...) (gfp-function frame slot value …) LISP (get-slot-values 'TRYPSYN-RXN 'LEFT) ==> (INDOLE-3-GLYCEROL-P SER)

33 Frame References At the GFP level, every Ocelot frame can be referred to using either symbol frame name or frame object Most GFP functions return frame objects Importance of using fequal for comparisons

34 Generic Frame Protocol
get-class-all-instances (Class) Returns direct and indirect instances of Class coercible-to-frame-p (Thing) Is Thing a frame? Returns True if Thing is the name of a frame, or a frame object; else False

35 Generic Frame Protocol
Notation Frame.Slot means a specified slot of a specified frame. Note: Slot must be a symbol! get-slot-value(Frame Slot) Returns first value of Frame.Slot get-slot-values(Frame Slot) Returns all values of Frame.Slot as a list slot-has-value-p(Frame Slot) Returns True if Frame.Slot has at least one value; else False member-slot-value-p(Frame Slot Value) Returns True if Value is one of the values of Frame.Slot; else False Instance-all-instance-of-p(Instance Class) Returns True if Instance is an all-instance of Class

36 Generic Frame Protocol
print-frame(Frame) Prints the contents of Frame

37 Generic Frame Protocol – Update Operations
put-slot-value(Frame Slot Value) Replace the current value(s) of Frame.Slot with Value put-slot-values(Frame Slot Value-List) Replace the current value(s) of Frame.Slot with Value-List, which must be a list of values add-slot-value(Frame Slot Value) Add Value to the current value(s) of Frame.Slot, if any remove-slot-value(Frame Slot Value) Remove Value from the current value(s) of Frame.slot replace-slot-value(Frame Slot Old-Value New-Value) In Frame.Slot, replace Old-Value with New-Value remove-local-slot-values(Frame Slot) Remove all of the values of Frame.Slot

38 Generic Frame Protocol – Update Operations
save-kb Saves the current KB

39 Additional Pathway Tools Functions – Semantic Inference Layer
Semantic inference layer defines built-in functions to compute commonly required relationships in a PGDB

40 PerlCyc and JavaCyc Work on Unix (Solaris or Linux) only
Start up Pathway Tools with the –api arg Pathway Tools listens on a Unix socket – perl program communicates through this socket Supports both querying and editing PGDBs Must run perl or java program on the same machine that runs Pathway Tools This is a security measure, as the API server has no built-in security Can only handle one connection at a time

41 Obtaining PerlCyc and JavaCyc
Download from PerlCyc written and maintained by Lukas Mueller at Boyce Thompson Institute for Plant Research. JavaCyc written by Thomas Yan at Carnegie Institute, maintained by Lukas Mueller. Easy to extend…

42 Examples of PerlCyc, JavaCyc Functions
getSlotValues getClassAllInstances putSlotValues genesOfReaction findIndexedFrame pathwaysOfGene transportP GFP functions (require knowledge of Pathway Tools schema): get_slot_values get_class_all_instances put_slot_values Pathway Tools functions (described at genes_of_reaction find_indexed_frame pathways_of_gene transport_p

43 Writing a PerlCyc or JavaCyc program
Create a PerlCyc, JavaCyc object: perlcyc -> new (“ORGID”) new Javacyc (“ORGID”) Call PerlCyc, JavaCyc functions on this object: my $cyc = perlcyc -> new (“ECOLI”); = $cyc -> all_pathways (); Javacyc cyc = new Javacyc(“ECOLI”); ArrayList pathways = cyc.allPathways (); Functions return object IDs, not objects. Must connect to server again to retrieve attributes of an object. foreach my $p { print $cyc -> get_slot_value ($p, “COMMON-NAME”);} for (int i=0; I < pathways.size(); i++) { String pwy = (String) pathways.get(i); System.out.println (cyc.getSlotValue (pwy, “COMMON-NAME”); }

44 Sample PerlCyc Query Number of proteins in E. coli use perlcyc;
my $cyc = perlcyc -> new (“ECOLI”); = $cyc-> get_class_all_instances("|Proteins|"); my $protein_count = print "Protein count: $protein_count.\n";

45 Sample PerlCyc Query Print IDs of all proteins with molecular weight between 10 and 20 kD and pI between 4 and 5. use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); foreach my $p ($cyc->get_class_all_instances("|Proteins|")) { my $mw = $cyc->get_slot_value($p, "molecular-weight-kd"); my $pI = $cyc->get_slot_value($p, "pi"); if ($mw <= 20 && $mw >= 10 && $pI <= 5 && $pI >= 4) { print "$p\n"; }

46 Sample PerlCyc Query List all the transcription factors in E. coli, and the list of genes that each regulates: use perlcyc; my $cyc = perlcyc -> new (“ECOLI”); foreach my $p ($cyc->get_class_all_instances("|Proteins|")) { if ($cyc->transcription_factor_p($p)) { my $name = $cyc->get_slot_value($p, "common-name"); my %genes = (); foreach my $tu ($cyc->regulon_of_protein($p)) { foreach my $g ($cyc->transcription_unit_genes($tu)) { $genes{$g} = $cyc->get_slot_value($g, "common-name"); } print "\n\n$name: "; print join " ", values %genes;

47 Sample Editing Using PerlCyc
Add a link from each gene to the corresponding object in MY-DB (assume ID is same in both cases) use perlcyc; my $cyc = perlcyc -> new (“HPY”); = $cyc->get_class_all_instances (“|Genes|”); foreach my $g { $cyc->add_slot_value ($g, “DBLINKS”, “(MY-DB \”$g\”)”); } $cyc->save_kb();

48 Sample JavaCyc Query: Enzymes for which ATP is a regulator
import java.util.*; public class JavacycSample { public static void main(String[] args) { Javacyc cyc = new Javacyc("ECOLI"); ArrayList regframes = cyc.getClassAllInstances("|Regulation-of-Enzyme-Activity|"); for (int i = 0; i < regframes.size(); i++) { String reg = (String)regframes.get(i); boolean bool = cyc.memberSlotValueP(reg, “Regulator", "ATP"); if (bool) { String enzrxn = cyc.getSlotValue (reg, “Regulated-Entity”); String enzyme = cyc.getSlotValue (enzrxn, “Enzyme”); System.out.println(enz); } } } }

49 Simple Lisp Query Example: Enzymes for which ATP is a regulator
(defun atp-inhibits () (loop for x in (get-class-all-instances '|Regulation-of-Enzyme-Activity|) ;; Does the Regulator slot contain the compound ATP, and the mode ;; of regulation is negative (inhibition)? when (and (member-slot-value-p x ‘Regulator 'ATP) (member-slot-value-p x ‘Mode “-”) ) ;; Whenever the test is positive, we collect the value of the slot Enzyme ;; of the Regulated-Entity of the regulatory interaction frame. ;; The collected values are returned as a list, once the loop terminates. collect (get-slot-value (get-slot-value x ‘Regulated-Entity) ‘Enzyme) ) ) ;;; invoking the query: (select-organism :org-id 'ECOLI) (atp-inhibits) (get-slot-values 'TRYPSYN-RXN 'LEFT) ==> (INDOLE-3-GLYCEROL-P SER)

50 Simple Perl Query Example: Enzymes for which ATP is a regulator
use perlcyc; my $cyc = perlcyc -> new("ECOLI"); = $cyc -> get_class_all_instances("|Regulation-of-Enzyme- Activity|"); ## We check every instance of the class foreach my $reg { ## We test for whether the INHIBITORS-ALL ## slot contains the compound frame ATP my $bool1 = $cyc -> member_slot_value_p($reg, “Regulator", "Atp"); my $bool2 = $cyc -> member_slot_value_p($reg, “Mode", “-"); if ($bool1 && $bool2) { ## Whenever the test is positive, we collect the value of the slot ENZYME . ## The results are printed in the terminal. my $enzrxn = $cyc -> get_slot_value($reg, “Regulated-Entity"); my $enz = $cyc -> get_slot_value($enzrxn, "Enzyme"); print STDOUT "$enz\n"; }

51 Getting started with Lisp
pathway-tools –lisp (load “file”) (compile-file “file.lisp”) Emacs is a useful editor Pathway Tools source code is available: ask Overview of Lisp information resources: Documented Pathway Tools Lisp functions:

52 Viewing Results via the Answer List
(loop for r in (get-class-all-instances '|Reactions|) when (< 3 (length (get-slot-values r 'left))) collect r) (setq answer *) (object-table answer) (replace-answer-list answer) (pt) Next Answer

53 Query Gotchas Study schema carefully :test #’fequal
Cascade of slot-values: check for NIL

54 Semantic Inference Layer relationships.lisp
Library of functions that encapsulate common query building blocks and intricacies of navigating the schema enzymes-of-gene reactions-of-gene pathways-of-gene genes-of-pathway pathway-hole-p reactions-of-compound top-containers(protein) all-rxns(type) (:metab-smm :metab-all :metab-pathways :enzyme :transport etc.) (all-rxns :metab-pathways)

55 Pathway Tools Schema and Semantic Inference Layer Genes, Operons, and Replicons

56 Representing a Genome Gene1 Product1 Gene2 CHROM1 Gene3 CHROM2 ORG
components Gene1 Product1 CHROM1 Gene2 genome Gene3 ORG CHROM2 PLASMID1 Classes: ORG is of class Organisms CHROM1 is of class Chromosomes PLASMID1 is of class Plasmids Gene1 is of class Genes Product1 is of class Polypeptides or RNA

57 Review slots of COLI and of COLI-K12
Polynucleotides Review slots of COLI and of COLI-K12

58 Genetic-Elements Sequence is stored in a separate file or database table

59 Polymer-Segments Review slots of Genes

60 Complexities of Gene / Gene-Product Relationships
The Product of a gene can be an instance of Polypeptides or RNAs An instance of Polypeptides can have more than one gene encoding it Sequence position: Nucleotide positions of starting and ending codons specified in Left-End-Position and Right-End-Position (usually greater, except at origin) Transcription-Direction + / - Alternative splicing: Nucleotide positions of starting and ending codons specified in Left-End-Position and Right-End-Position Intron positions specified in Splice-Form-Introns of gene product ( ) ( )

61 Gene Reaction Schematic

62 Exercises Find all genes on a given chromosome Find all ribosomal RNAs
Find the DNA sequence of a given gene Find all proteins longer than 1,000 amino acids

63 Exercises Find all genes on a given chromosome
(defun genes-of-chrom (chrom) (loop for x in (get-slot-values chrom ‘components) when (instance-all-instance-of-p x ‘|Genes|) collect x) ) Find all ribosomal RNAs (get-class-all-instances ‘|rRNAs|) Find the DNA sequence of a given gene (get-gene-sequence gene)

64 Exercises Find all monomers longer than 1,000 nucleotides
(loop for g in (get-class-all-instances ‘|Genes|) for p = (get-slot-value g ‘product) when (and (< 1000 (abs (- (get-slot-value g ‘left-end-position) (get-slot-value g ‘right-end-position) ))) (instance-all-instance-of-p p ‘|Polypeptides|) ) collect p )

65 Proteins

66 Proteins and Protein Complexes
Polypeptide: the monomer protein product of a gene (may have multiple isoforms, as indicated at gene level) Protein complex: proteins consisting of multiple polypeptides or protein complexes Example: DNA pol III DnaE is a polypeptide pol III core is DnaE and two other polypeptides pol III holoenzymes is several protein complexes combined

67 Protein Complex Relationships

68 Slots of a protein (DnaE)
catalyzes Is it an activator/reactant/etc? comments component-of dblinks features (edited in feature editor) Many other features possible

69 A complex at the frame level (pol III)
Same features as polypeptide frame, different use comment component-of and components note coefficients

70 Protein Complex Relationships

71 Relationships are Defined in Many Places
component-of comes from creating a complex appears-in-left-side-of comes from defining a reaction (as do modified forms) inhibitor-of comes from an enzymatic reaction can only edit dna-footprint if protein has been associated with a TU

72 Semantic Inference Layer
Reactions-of-protein (prot) Returns a list of rxns this protein catalyzes Transcription-units-of-proteins(prot) Returns a list of TU’s activated/inhibited by the given protein Transporter? (prot) Is this protein a transporter? Polypeptide-or-homomultimer?(prot) Transcription-factor? (prot) Obtain-protein-stats Returns 5 values Length of : all-polypeptides, complexes, transporters, enzymes, etc…

73 Example Find all enzymes that use pyridoxal phosphate as a cofactor or prosthetic group (loop for protein in (get-class-all-instances ‘|Proteins|) for enzrxn = (get-slot-value protein ‘enzymatic-reaction) when (and enzrxn (or (member-slot-value-p enzrxn ‘cofactors ‘pyridoxal_phosphate) (member-slot-value-p enzrxn ‘prosthetic-groups ‘pyridoxal_phosphate)) collect protein) (member-slot-value-p frame slot value) : T if Value is one of the values of Slot of Frame.

74 Example Queries Find all homomultimers
Find proteins whose pI > 10, and that reside on the negative strand of the first chromosome

75 Sample Find all proteins without a comment anywhere

76 Compounds / Reactions / Pathways

77 Compounds / Reactions / Pathways
Think of a three tiered structure: Reactions built on top of compounds Pathways built on top of reactions Metabolic network defined by reactions alone; pathways are an additional “optional” structure Some reactions not part of a pathway Some reactions have no attached enzyme Some enzymes have no attached gene

78 Compounds


80 Compounds Relatively few aspects of a compound defined within the compound editor MW, formula calculated from edited structure Most aspects defined in other editors “Pathway reactions” comes from reaction editing followed by pathway editing Activator, etc come from the enzymatic reaction editor

81 Types: |Amino-Acid|, |Aromatic-Amino-Acids|, |Non-polar-amino-acids|
-- Instance TRP --- Types: |Amino-Acid|, |Aromatic-Amino-Acids|, |Non-polar-amino-acids| APPEARS-IN-LEFT-SIDE-OF: RXN0-287, TRANS-RXN-76, TRYPTOPHAN-RXN, TRYPTOPHAN--TRNA-LIGASE-RXN APPEARS-IN-RIGHT-SIDE-OF: RXN0-2382, RXN0-301, TRANS-RXN-76, TRYPSYN-RXN CHEMICAL-FORMULA: (C 11), (H 12), (N 2), (O 2) COMMON-NAME: "L-tryptophan" DBLINKS: (LIGAND-CPD "C00078" NIL |kaipa| NIL NIL), (CAS " "), (CAS " ") NAMES: "L-tryptophan", "W", "tryptacin", "trofan", "trp", "tryptophan", "2-amino-3-indolylpropanic acid" SMILES: "c1(c(CC(N)C(=O)O)c2(c([nH]1)cccc2))" SYNONYMS: "W", "tryptacin", "trofan", "trp", "tryptophan", ____________________________________________

82 Where is diphosphate in the ontology?

83 Semantic Inference Layer
Reactions-of-compound (cpd) Pathways-of-compound (cpd) Is-substrate-an-autocatalytic-enzyme-p (cpd) Activated/inhibited-by? (cpds slots) Returns a list of enzrxns for which a cpd in cpds is a modulator (example slots: activators-all, activators-allosteric) All-substrates (rxns) All unique substrates specified in the given rxns Has-structure-p (cpd) Obtain-cpd-stats Returns two values: Length of :all-cpds, cpds with structures

84 Miscellaneous things….
History List Back/Forward and History buttons Default list is 50 items Show frame (print-frame ‘frame)


86 Queries with Multiple Answers
Navigator queries: Example: Substring search for “pyruvate” Selected list is placed on the Answer list Use “Next Answer” button to view each one of them Lisp queries: Example : Find reactions involving pyruvate as a substrate (get-class-all-instances ‘|Compounds|) (loop for rxn in (get-class-all-instances ‘|Reactions|) when (member ‘pyruvate (get-slot-values rxn ‘substrates) collect rxn) (replace-answer-list * )

Download ppt "Computing with Pathway/Genome Databases"

Similar presentations

Ads by Google