Presentation is loading. Please wait.

Presentation is loading. Please wait.

Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural.

Similar presentations


Presentation on theme: "Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural."— Presentation transcript:

1 Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural Bioinformatics Chapters 10

2 Pharm201 Lecture 4 20092 Take Home Message Good data representation of complex data is not a trivial undertaking However it is prerequisite to effective use of those data History often precludes the above You should have got a sense of the first item from the assignment

3 Pharm201 Lecture 4 20093 Global Considerations in Defining a Data Representation Scope - breadth and depth of data to be included Name space How to cast that data What will the definition be used for? – Archiving, schema representation, methods...

4 Pharm201 Lecture 4 20094  Simple query, browsing and retrieval  Consistent data resulting from autonomous validation and verification  Simple and consistent data exchange  A unified view of disparate types of data  Accommodation of new knowledge as it is discovered  Inclusion of procedures (methods) to specify how a particular item of data is derived or verified. Global Considerations – More Specifically

5 Pharm201 Lecture 4 20095 Given These Considerations – Where Does the PDB Format Fit In? First we need to examine the format

6 Pharm201 Lecture 4 20096 The PDB Format A full description is here It was designed around an 80 column punched card! It was designed to be human readable It is used by every piece of software that deals with structural data

7 Pharm201 Lecture 4 20097 The PDB Format - Records Every PDB file may be broken into a number of lines terminated by an end-of-line indicator. Each line in the PDB entry file consists of 80 columns. The last character in each PDB entry should be an end-of-line indicator. Each line in the PDB file is self-identifying. The first six columns of every line contain a record name, left-justified and blank-filled. This must be an exact match to one of the stated record names. The PDB file may also be viewed as a collection of record types. Each record type consists of one or more lines. Each record type is further divided into fields.

8 Pharm201 Lecture 4 20098 The PDB Format – An Example – The Header

9 Pharm201 Lecture 4 20099 The PDB Format – An Example – The Atomic Coordinates

10 Pharm201 Lecture 4 200910 The Description – Atom Records

11 Pharm201 Lecture 4 200911 What is Wrong with this Approach? The description and the data are separate Parsing is a nightmare – the most complex piece of code we have in our research laboratory probably remains the PDB parser There are no relationships between items of data Some data just cannot be parsed ….

12 Pharm201 Lecture 4 200912 REMARK 3 REFINEMENT. BY THE RESTRAINED LEAST-SQUARES PROCEDURE OF REMARK 3 J. KONNERT AND W. HENDRICKSON (PROGRAM *PROLSQ*). THE R REMARK 3 VALUE IS 0.168 FOR 2680 REFLECTIONS WITH I GREATER THAN REMARK 3 2.0*SIGMA(I) REPRESENTING 74 PER CENT OF THE TOTAL REMARK 3 AVAILABLE DATA IN THE RESOLUTION RANGE 10.0 TO 2.0 REMARK 3 ANGSTROMS. REMARK 4 THE ERABUTOXIN A (EA) CRYSTAL STRUCTURE IS ISOMORPHOUS WITH REMARK 4 THE KNOWN STRUCTURE OF ERABUTOXIN B (PROTEIN DATA BANK REMARK 4 ENTRIES *2EBX*, *3EBX*). EA DIFFERS FROM EB BY A SINGLE REMARK 4 SUBSTITUTION - EA ASN 26 FOR EB HIS 26. THE EA STARTING REMARK 4 MODEL WAS OBTAINED FROM A MOLECULAR REPLACEMENT STUDY IN REMARK 4 WHICH COORDINATES FOR 309 OF THE 475 ATOMS IN THE EB REMARK 4 STRUCTURE (*2EBX*) WERE USED. PDB Format - Important Components of the Data are Lost to All But Humans

13 Pharm201 Lecture 4 200913 Enter mmCIF Prerequisite reading: http://www.sdsc.edu/pb/papers/methenz97.pdf http://www.sdsc.edu/pb/papers/methenz97.pdf Complete information: http://mmcif.pdb.org

14 Pharm201 Lecture 4 200914 The macromolecular Crystallographic Information File (mmCIF) – An Approach to Addressing Problems with the PDB Format Has the support of a major scientific society In the backbone of the current PDB Provides a rich description of very complex data Predates any use of ontologies, Web developments, CORBA, XML etc. Still has some problems

15 Pharm201 Lecture 4 200915 The temperature is 30 degrees A human would know whether that was Centigrade or Fahrenheit with additional context. A computer would have more difficulty! What would be the point of archiving such data if in 10 years the meaning was lost mmCIF - Initial Motivator Circa Late 1980’s

16 Pharm201 Lecture 4 200916 All PDB data should be captured Describe a paper’s material and methods section Describe biologically active molecule Fully describe secondary structure but not tertiary or quaternary Describe details of chemistry (inc. 2D) Meaningful 3D views mmCIF – Scope of the Initial Effort

17 Pharm201 Lecture 4 200917 mmCIF - Topology

18 Pharm201 Lecture 4 200918 Data are defined in data blocks A global declaration spans data blocks Data exists as name-value pairs A data name may appear only once in a data block Loop constructs are supported mmCIF - STAR Encoding Rules

19 Pharm201 Lecture 4 200919 loop_ _atom_site.group_PDB _atom_site.type_symbol _atom_site.label_atom_id _atom_site.label_comp_id _atom_site.label_asym_id _atom_site.label_seq_id _atom_site.label_alt_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.occupancy _atom_site.B_iso_or_equiv _atom_site.footnote_id _atom_site.entity_id _atom_site.entity_seq_num _atom_site.id ATOM N N VAL A 11. 25.360 30.691 11.795 1.00 17.93. 1 11 1 ATOM C CA VAL A 11. 25.970 31.965 12.332 1.00 17.75. 1 11 2 ATOM C C VAL A 11. 25.569 32.010 13.881 1.00 17.83. 1 11 3 mmCIF - Extract from a Data File

20 Pharm201 Lecture 4 200920 save__atom_site.Cartn_x _item_description.description ; The x atom site coordinate in angstroms specified according to a set of orthogonal Cartesian axes related to the cell axes as specified by the description given in _atom_sites.Cartn_transform_axes. ; _item.name '_atom_site.Cartn_x' _item.category_id atom_site _item.mandatory_code no _item_aliases.alias_name '_atom_site_Cartn_x' _item_aliases.dictionary cifdic.c94 _item_aliases.version 2.0 loop_ _item_dependent.dependent_name '_atom_site.Cartn_y' '_atom_site.Cartn_z' _item_related.related_name '_atom_site.Cartn_x_esd' _item_related.function_code associated_esd _item_sub_category.id cartesian_coordinate _item_type.code float _item_type_conditions.code esd _item_units.code angstroms mmCIF - Extract from the Dictionary

21 Pharm201 Lecture 4 200921 The DDL category item_description holds a description for each data item. The key item for this category is item_description.name which is defined in the parent category item. The text of the item description is held by item _item_description.description. A single description may be provided for each data item. The DDL for the item_description category is given in the following section. save_ITEM_DESCRIPTION _category.description ; This category holds the descriptions of each data item. ; _category.id item_description _category.mandatory_code yes loop_ _category_key.name '_item_description.name' '_item_description.description' loop_ _category_group.id 'ddl_group' 'item_group' save_ save__item_description.description _item_description.description ; Text decription of the defined data item. ; _item.name '_item_description.description' _item.category_id item_description _item.mandatory_code yes _item_type.code text save_ mmCIF Dictionary Definition Language

22 Pharm201 Lecture 4 200922 mmCIF – Topology Revisited

23 Pharm201 Lecture 4 200923 STRUCT_BIOL STRUCT_BIOL_GEN STRUCT_ASYM ENTITY ENTITY_POLY ENTITY_POLY_SEQ CHEM_COMP ATOM_SITE STRUCT_CONF STRUCT_CONN STRUCT_SITE_GEN STRUCT_REF mmCIF - The Category Group Organization of any Macromolecular Structure

24 Pharm201 Lecture 4 200924 mmCIF - Entity - Unique Chemical Component

25 mmCIF - Defining Secondary Structure 25

26 mmCIF - Other Interactions 26Pharm201 Lecture 4 2009

27 27 mmCIF – Defining the Biological Molecule

28 Pharm201 Lecture 4 200928 mmCIF – Defining non-standard Amino Acids

29 Pharm201 Lecture 4 200929 mmCIF - Problems No header for recognition as an mmCIF file No reference to describe what dictionary created the data file The environment as each level of the hierarchy is different e.g. between categories and category groups Another implication of this is the complex way of defining the absolute name of a piece of data: filename-datablock-data_name-loop_iteration Granularity of the data is too course Poor data typing Little software!!!!

30 Pharm201 Lecture 4 200930 Summary mmCIF has provided the PDB with a robust data representation which serves as conceptual and physical schema upon which the current RCSB, PDBe and PDBj are built This work predated XML and XML-schema but embodies the important concepts inherent in these descriptions mmCIF was later exactly converted into XML and is now used more than mmCIF, but much less than the old PDB format

31 Pharm201 Lecture 4 200931 Take Home Message Good data representation of complex data is not a trivial undertaking However it is prerequisite to effective use of those data


Download ppt "Pharm201 Lecture 4 20091 Data Representation Pharm 201/Bioinformatics I Philip E. Bourne Department of Pharmacology, UCSD Prerequisite Reading: Structural."

Similar presentations


Ads by Google