Presentation is loading. Please wait.

Presentation is loading. Please wait.

PHAR 201 Lecture 4, 20121 Data Representation and the Role of Ontologies PHAR 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite.

Similar presentations


Presentation on theme: "PHAR 201 Lecture 4, 20121 Data Representation and the Role of Ontologies PHAR 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite."— Presentation transcript:

1 PHAR 201 Lecture 4, 20121 Data Representation and the Role of Ontologies PHAR 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite reading: Genome Research (2001) 11:1425-1433

2 Consider this Course a Workflow in How You Will Handle Data (Regardless of Type) For the Rest of Your Lives We Use Macromolecular Structure Data to Illustrate the Process And Hence Learn Structural Bioinformatics in the Process Data InUnderstand the scope and complexity of the data Understand the experiment to understand the errors Understand how to best represent (model) the data Understand the methods to physically instantiate the model Recognize redundancy In the data Classify the dataAnalyze the data Discover new science From the data 2PHAR 201 Lecture 4, 2012

3 3 Agenda Before there were ontologies there was mmCIF Briefly review the history of ontology development Review the Gene Ontology (GO) –Motivation –Features –Related research activities around GO

4 PHAR 201 Lecture 4, 20124 The PDB Format A full description is here It was designed around an 80 column punched card! It was designed to be human readable It is used by almost every piece of software that deals with structural data

5 PHAR 201 Lecture 4, 20125 The PDB Format - Records Every PDB file may be broken into a number of lines terminated by an end-of-line indicator. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of-line indicator. Each line in the PDB file is self-identifying. The first six columns of every line contain a record name, left-justified and blank-filled. This must be an exact match to one of the stated record names. The PDB file may also be viewed as a collection of record types. Each record type consists of one or more lines. Each record type is further divided into fields.

6 PHAR 201 Lecture 4, 20126 The PDB Format – An Example – The Header

7 PHAR 201 Lecture 4, 20127 The PDB Format – An Example – The Atomic Coordinates

8 PHAR 201 Lecture 4, 20128 The Description – Atom Records

9 PHAR 201 Lecture 4, 20129 What is Wrong with this Approach? The description and the data are separate Parsing is a nightmare – the most complex piece of code we have in our research laboratory probably remains the PDB parser There are no relationships between items of data Some data just cannot be parsed The fixed column format cannot represent some of today’s structures …

10 Structures are Spread Over Multiple Files – Most Users are Not Aware of this PHAR 201 Lecture 4, 201210

11 PHAR 201 Lecture 4, 201211 REMARK 3 REFINEMENT. BY THE RESTRAINED LEAST-SQUARES PROCEDURE OF REMARK 3 J. KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE R REMARK 3 VALUE IS 0.168 FOR 2680 REFLECTIONS WITH I GREATER THAN REMARK 3 2.0*SIGMA(I) REPRESENTING 74 PER CENT OF THE TOTAL REMARK 3 AVAILABLE DATA IN THE RESOLUTION RANGE 10.0 TO 2.0 REMARK 3 ANGSTROMS. REMARK 4 THE ERABUTOXIN A (EA) CRYSTAL STRUCTURE IS ISOMORPHOUS WITH REMARK 4 THE KNOWN STRUCTURE OF ERABUTOXIN B (PROTEIN DATA BANK REMARK 4 ENTRIES *2EBX*, *3EBX*). EA DIFFERS FROM EB BY A SINGLE REMARK 4 SUBSTITUTION - EA ASN 26 FOR EB HIS 26. THE EA STARTING REMARK 4 MODEL WAS OBTAINED FROM A MOLECULAR REPLACEMENT STUDY IN REMARK 4 WHICH COORDINATES FOR 309 OF THE 475 ATOMS IN THE EB REMARK 4 STRUCTURE (*2EBX*) WERE USED. PDB Format - Important Components of the Data are Lost to All But Humans

12 mmCIF Was Developed to Address these Problems Methods in Enzymology. 1997 277, 571-590 PHAR 201 Lecture 4, 201212

13 PHAR 201 Lecture 4, 201213 All PDB data should be captured Describe a paper’s material and methods section Describe biologically active molecule Fully describe secondary structure but not tertiary or quaternary Describe details of chemistry (inc. 2D) Meaningful 3D views mmCIF – Scope of the Initial Effort

14 PHAR 201 Lecture 4, 201214 loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.entity_id _atom_site.entity_seq_num _atom_site.id ATOM N N VAL A 11. 25.360 30.691 11.795 1.00 17.93. 1 11 1 ATOM C CA VAL A 11. 25.970 31.965 12.332 1.00 17.75. 1 11 2 ATOM C C VAL A 11. 25.569 32.010 13.881 1.00 17.83. 1 11 3 mmCIF - Extract from a Data File

15 PHAR 201 Lecture 4, 201215 save__atom_site.Cartn_x _item_description.description ; The x atom site coordinate in angstroms specified according to a set of orthogonal Cartesian axes related to the cell axes as specified by the description given in _atom_sites.Cartn_transform_axes. ; _item.name '_atom_site.Cartn_x' _item.category_id atom_site _item.mandatory_code no _item_aliases.alias_name '_atom_site_Cartn_x' _item_aliases.dictionary cifdic.c94 _item_aliases.version 2.0 loop_ _item_dependent.dependent_name '_atom_site.Cartn_y' '_atom_site.Cartn_z' _item_related.related_name '_atom_site.Cartn_x_esd' _item_related.function_code associated_esd _item_sub_category.id cartesian_coordinate _item_type.code float _item_type_conditions.code esd _item_units.code angstroms mmCIF - Extract from the Dictionary

16 PHAR 201 Lecture 4, 201216 Summary mmCIF has provided the PDB with a robust data representation which serves as conceptual and physical schema upon which the current RCSB, PDBe and PDBj are built This work predated XML and XML-schema but embodies the important concepts inherent in these descriptions mmCIF was later exactly converted into XML and is now used more than mmCIF, but much less than the old PDB format PDB format will be phased out over a period of years

17 PHAR 201 Lecture 4, 201217 Agenda Before there were ontologies there was mmCIF Briefly review the history of ontology development Review the Gene Ontology (GO) –Motivation –Features –Related research activities around GO

18 Formal Definitions Taken from Knowledge Engineering …. 1.A systematic account of existence. 2. (From philosophy) An explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them. 3.For AI systems, what "exists" is that which can be represented. When the knowledge about a domain is represented in a declarative language, the set of objects that can be represented is called the universe of discourse. We can describe the ontology of a program by defining a set of representational terms. Definitions associate the names of entities in the universe of discourse (e.g. classes, relations, functions or other objects) with human-readable text describing what the names mean, and formal axioms that constrain the interpretation and well-formed use of these terms. Formally, an ontology is the statement of a logical theory. 18PHAR 201 Lecture 4, 2012

19 19 Formal Definitions Taken from Knowledge Engineering Continued 4.A set of agents that share the same ontology will be able to communicate about a domain of discourse without necessarily operating on a globally shared theory. We say that an agent commits to an ontology if its observable actions are consistent with the definitions in the ontology. The idea of ontological commitment is based on the Knowledge-Level perspective. 5.The hierarchical structuring of knowledge about things by subcategorizing them according to their essential (or at least relevant and/or cognitive) qualities. See subject index. This is an extension of the previous senses of "ontology" (above) which has become common in discussions about the difficulty of maintaining subject indices.

20 PHAR 201 Lecture 4, 201220 We will not focus too much on the formal definitions But more on how these formal concepts have been applied to biology

21 PHAR 201 Lecture 4, 201221 The History of Ontologies from a Biological Perspective … Early biological database efforts (1990’s) adopted knowledge bases as a model e.g. RiboWeb They used the products from the AI community e.g. Ontolingua Some of the concepts of knowledge bases remain – notably ontologies, but they are now mostly cast in more familiar commercial frameworks e.g. relational databases

22 PHAR 201 Lecture 4, 201222 The History of Ontologies from a Biological Perspective Continued Biological community in general was slow to see the value Medical informatics community adopted ontologies early Late 90’s database providers in particular began to work together – the gene ontology (GO) being a major product of this effort 1998-2004 ontologies were the rage and warranted their own session at Bioinformatics meetings and are taken seriously by the biological community 2004- accepted as part of biological data representation and use

23 PHAR 201 Lecture 4, 201223 The History of Ontologies from a Biological Perspective Continued Centers established to support the maintenance of ontologies: –The Open Biomedical Ontologies (OBO) Foundry –National Center for Biomedical Ontology (BioPortal 2.0)

24 PHAR 201 Lecture 4, 201224 What Isn’t An Ontology? A database or program –because they share internal formats only – it is not global A table of contents –Because it is not a formal representation of the concepts A terminology (aka controlled vocabulary) –Because it is a set of terms without a formal structure of how they relate

25 PHAR 201 Lecture 4, 201225 Examples of Valuable Terminologies (Controlled Vocabularies) That Are Not Ontologies ICD-9 for diseases SNOMED/RCD codes for symptoms EC Numbers (?) Taxonomy SMILES strings

26 PHAR 201 Lecture 4, 201226 Ontology As Language The ontology becomes the language of the domain it describes The language = syntax + semantics While that language must be understood by computers human readability counts

27 PHAR 201 Lecture 4, 201227 Ontology as Contract Purposes of Ontologies data exchange unification/translation calling knowledge services representing theories human communication Parties to the contract programmers data admins programmers, netbots scientists collaborators

28 Ontology Specifications XML – provides a syntax for structured documents XML Schema - a language for structuring XML documents and adding data types RDF - a data model for objects and relations between them and represented in XML RDF Schema – describes properties and classes of RDF resources with semantics to generalize OWL 2 – Web Ontology Language – adds more vocabulary particularly of relationships between classes (e.g. disjointness, cardinality) PHAR 201 Lecture 4, 201228

29 Here is Another One.. http://richard.cyganiak.de/2007/10/lod/lod-datasets_2010-09-22_colored.html PHAR 201 Lecture 4, 201229

30 PHAR 201 Lecture 4, 201230 References: Ontologies in Bioinformatics Bio-ontologies workshops since 1997 Historical papers on knowledge sharing mmCIF as an ontology - Westbrook and Bourne (2000) Bioinformatics 16(2) 159-168 [PDF]PDF Review 2006 – Bodenreider and Stevens Briefings in Bioinformatics

31 PHAR 201 Lecture 4, 201231 Agenda Before there were ontologies there was mmCIF Briefly review the history of ontology development Review the Gene Ontology (GO) –Motivation –Features –Related research activities around GO

32 PHAR 201 Lecture 4, 201232 References GO Itself - Creating the Gene Ontology Resource: Design and Implementation Genome Research (2001) 11:1425-1433Creating the Gene Ontology Resource: Design and Implementation Nucleic Acids Res. 2010 Jan;38(Database issue):D331- 5. Epub 2009 Nov 17.Nucleic Acids Res. 2010 Jan;38(Database issue):D331- 5. Epub 2009 Nov 17 The GO Website - http://www.geneontology.orghttp://www.geneontology.org Application of GO – The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro Genome Res. 2003 Apr;13(4):662-72. Epub 2003 Mar 12

33 PHAR 201 Lecture 4, 201233 Brief History Started by Saccharomyces Genome Database, FlyBase and the Mouse Genome Database Grown to a consortium of members (see here) here

34 PHAR 201 Lecture 4, 201234 Roles of the GO Consortium Write and maintain the ontologies themselves Associate the ontologies to genes in the respective databases of members Provide tools to facilitate the development and maintenance of ontologies

35 PHAR 201 Lecture 4, 201235 Gene Ontology (GO) http://www.geneontology.org/ Three levels of annotation: – Molecular function - what a gene product does at the biochemical level – Biological process - a broad biological perspective – not currently a pathway (no dynamics or dependencies) – Cellular component - location within cellular structures (eg Golgi apparatus) and macromolecular complexes (ribosome)

36 PHAR 201 Lecture 4, 201236 GO Goals From Genome Res 2001 Aug;11(8):1425-33

37 PHAR 201 Lecture 4, 201237 Structure of GO- Directed Acyclic Graph (DAG) Example from molecular function: Transmembrane receptor tyrosine protein kinase Child Parent Transmembrane receptor Protein tyrosine kinase is_a

38 PHAR 201 Lecture 4, 201238 Structure of GO- Directed Acyclic Graph (DAG) Relationship of Child to Parent is_a represents an instance of part_of A mitotic chromosome is_a instance of a chromosome A telomere is part_of a chromosome

39 PHAR 201 Lecture 4, 201239 Example - Molecular Function

40 PHAR 201 Lecture 4, 201240 Example - Biological Process

41 PHAR 201 Lecture 4, 201241 Example - Cellular Location

42 Use of GO within the PDB http://pdb.rcsb.org http://pdb.rcsb.org PHAR 201 Lecture 4, 201242

43 Use of GO Within the Open Literature 43 PHAR 201 Lecture 4, 2012

44 44 Some Issues – Levels of Granularity – Species Specificity Chitin metabolism is part of cuticle synthesis in fly Chitin metabolism is part of cell wall organization in yeast

45 PHAR 201 Lecture 4, 201245 Some Issues GO is dynamic – parent child relationships can change When does a process begin and end? Is_a and part_of not always clear – is actin cytoskeleton is_a cytoskeleton or part_of cytoskeleton A community effort

46 PHAR 201 Lecture 4, 201246 Relationship to Gene Products A gene product is a protein or functional RNA A gene product may have more than one function and therefore be related to multiple GO terms The name of a gene product may only reflect one of its functions

47 PHAR 201 Lecture 4, 201247 GO is Really 3 Independent Ontologies Annotation of a gene product by one ontology is independent of its annotation by another ontology Example: Products of the MDH1 MDH2 and MDH3 genes are all isoforms of malate dehydrogenanse in yeast with the same function, but localize to different cellular locations and are involved in different biochemical processes

48 PHAR 201 Lecture 4, 201248 Evidence Codes The evidence for assigning a gene product to a GO term itself has a controlled vocabulary

49 PHAR 201 Lecture 4, 201249 Research Applications of GO

50 PHAR 201 Lecture 4, 201250

51 PHAR 201 Lecture 4, 201251 Research Applications of GO

52 PHAR 201 Lecture 4, 201252

53 PHAR 201 Lecture 4, 201253


Download ppt "PHAR 201 Lecture 4, 20121 Data Representation and the Role of Ontologies PHAR 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite."

Similar presentations


Ads by Google