Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gene Annotation and Analysis Lab Work Reference: European Multimedia Bioinformatics Educational Resource.

Similar presentations


Presentation on theme: "Gene Annotation and Analysis Lab Work Reference: European Multimedia Bioinformatics Educational Resource."— Presentation transcript:

1 Gene Annotation and Analysis Lab Work Reference: European Multimedia Bioinformatics Educational Resource

2 Chapter 6: Fold Classification The aim of this part of the tutorial is to learn about the structure of your protein. Where 3D coordinates are available, we start by examining the Protein Data Bank (PDB) summary files. We then examine some of the structure classification resources, and compare results to see if they are similar.

3 Step 1: Homologues with known 3D structure Homologues with known 3D structure In this step, we will seek homologues of known structure. i) Use your sequence to run BLAST at the EBI or NCBI, setting PDB as the database. Choose the most significant hit and note its PDB ID code. If a structure exists, it should be the first hit.EBINCBI ii) Alternatively, from your sequence's SWISS-PROT or TrEMBL entry, you could use the EMBnet Direct BLAST option, again selecting PDB as the database.SWISS-PROT or TrEMBL Reflections... – How many matches were statistically significant? – Why are there fewer hits from a BLAST of PDB than from a BLAST of a sequence database?

4 Step 2: PDB summary files Here, we will explore details of your structure from its PDB summary files. i) Supply the PDB ID code to the PDBsum query form. Examine the entry, including the images provided, the secondary structure elements, any associated ligand(s), the PDB header, and so on.PDBsum ii) Alternatively, supply a keyword to the Search string box (e.g., "rhodopsin"). Reflections... – How do the secondary structure elements shown here compare with the predictions from the previous Chapter? How accurate are the predictions? – What information is stored in the PDB header? What is the purpose of having a sequence stored in the header when sequence databases store this information? (HINT)HINT – Does your protein have an associated ligand? If so, which residues in the sequence interact with the ligand? Referring back to your sequence alignment, are these residues conserved? Would you expect them to be? Do any of them lie in the motifs defined by PROSITE, BLOCKS or PRINTS?

5 Step 3: Protein Classification PDBsum unites a number of resources. We will now make use of this feature to explore some structure classification databases and, where appropriate, an enzyme classification resource. i) From PDBsum, follow the CATH link. Follow the link(s) under the Quick Links heading to discover the position of your protein within the CATH hierarchy and view its structural relatives. To navigate the hierarchy, click on the links next to the hierarchy icons (). ii) From PDBsum, follow the SCOP link to discover the position of your protein within the SCOP hierarchy (under the heading Quick Links) and view its structural relatives. Follow links to the NCBI's PDB entry ( ) and explore the external links ( ). NOTE: For some proteins, there may be more than one polypeptide chain - the SCOP hierarchies can be explored by following these links independently. iii) Where applicable, PDBsum also links to the Enzyme Commission (EC) classification. Follow the EC->PDB link to view this, and the corresponding links to the Expasy ENZYME, KEGG and WIT databases.

6 Step 3: Protein Classification Reflections... – CATH: What is the CATH number of your protein? What does this number mean in terms of its Class, Architecture, Topology and Homology? (HINT)HINT How does the structural hierarchy differ from the familial hierarchies used in PRINTS? If your structure belongs to more than one class, why might this be so? – SCOP: How is your protein classified in SCOP in terms of its Class, Fold, Superfamily and Family? How does the structural hierarchy differ from the familial hierarchies used in PRINTS? Is the Pfam link to the same entry found in Chapter 3? If not, how is it related? Does the classification differ from that given by CATH. If so, why might this be so? (HINT)HINT If your structure belongs to more than one class, why might this be so? – EC->PDB: If there is no EC->PDB link in your entry, why might this be so? If there is a link, what is the EC number of the entry, and what catalytic role does this number reflect? (HINT)HINT

7 Step 4: Protein structure visualisation Here, we will visualise the structure of your protein and become familiar with molecular viewers and the PDB file format. i) Follow the link from PDBsum to the PDB entry. Click on QuickPDB (this requires a Java- enabled browser). Highlight residues that you know to be important, such as motifs identified in the protein family database searches. NOTE: Theoretical models are no longer available from the main PDB directory. ii) To download the raw PDB file, return to the PDB entry and follow the link to Download Files on the right-hand side menu. In the Download files menu, choose the uncompressed PDB link. View the file in a text editor to get a feel for how the annotation fields and atomic coordinates are encoded. iii) There are several PDB structure viewers and molecular visualisation packages available for download. Some examples are listed below. Downloadable Molecular Structure Viewers PDB structure viewersRasMolRasMol, QuickPDB and Deep ViewQuickPDBDeep View non-PDB format viewer (accepts files in an NCBI-specific format) Cn-3D

8 Step 4: Protein structure visualisation Reflections... – How do the conserved motifs relate to the structure? – What functional inference, if any, can you deduce from the relative positions of the conserved motifs in 3D? E.g., do they congregate around an active site? (HINT)HINT – In the PDB file, what is the name of the field used to store the 3D coordinates? (HINT)HINT

9 Quiz: Chapter 6 1. Databases such as CATH and SCOP are used to identify: A. The structural family to which a protein belongs. B. The genic family to which a protein belongs. C. Homologous proteins. D. Analogous proteins. 2. Resources such as EC->PDB are used to identify: A. The structural class of proteins. B. The catalytic activity of enzymes with known structure. C. The family to which a protein belongs. D. Details of the reaction mechanism of a protein. 3. In CATH, proteins are grouped together at the topology-level on the basis that they share: A. The same gross secondary structure composition. B. The same secondary structures but different connectivities. C. The same overall shape and connectivity of secondary structures. D. A common ancestor.

10 Quiz: Chapter 6 4. For SCOP, which of the following statements is TRUE: A. Entries are created using automated methods only. B. Entries are created using automated and manual methods. C. Entries are created using manual methods only. D. Entries are derived from CATH. 5. Coordinates for known protein structures are housed in? A. CATH. B. SCOP. C. PDBsum. D. PDB.

11 Information 6.1 PDB 6.2 PDB Summary 6.3 CATH 6.4 SCOP 6.5 EC->PDB 6.6 Visualisation of Protein Molecules

12 6.1 PDB The Protein Data Bank (PDB) is the principal repository of biological macromolecule structures. These are derived from a number of different experimental techniques (under the Materials and Methods section)including electron, x-ray and neutron diffraction, and NMR. The PDB is maintained by a non-profit consortium, termed the Research Collaboratory for Structural Bioinformatics (RCSB). Several mirrors are available worldwide from which PDB entries may be viewed and downloaded.Protein Data Bank experimental techniquesResearch Collaboratory for Structural Bioinformatics

13 6.1 PDB Stored in a text format PDB files contain a 'header' and a main body, which stores the atomic coordinates of all the resolved atoms in the structure. The header includes the following details: – information on the protein, organism, etc. – literature citations – protein sequence (which may be different from those found in sequence databases, e.g., if the protein has been engineered to facilitate crystallisation) – the method by which the structure was obtained – crystal packing and refinement information – secondary structure information (e.g., helix from residues 13-25, turn from residues 26-30, etc.)

14 6.1 PDB

15 6.1.1 Growth of the PDB Unlike DNA sequencing, protein structure determination is not yet a fully automated process. Different techniques have different limitations. In crystrallography, for example, obtaining crystals can be difficult; or, having got crystals, finding candidates that will diffract well can be problematic. Whatever the technique used, building a robust structural model from the raw data is a further time-limiting factor. The rate of submission of new structures to the PDB is thus far less than the deposition rate of sequence data to the central sequence repositories: e.g., in July 2002, PDB contained 16,507 entries, while in June 2002 GenBank contained 17,471,000 sequences - see Fig 6.2. Note that both of these figures are highly redundant, so the number of unique structures and sequences is very much smaller.techniques

16 6.1.1 Growth of the PDB Fig 6.2. Difference in the growth of the number of sequences in GenBank vs. the number of 3D structures in PDB. The graph has been truncated at 1994 to keep the curves on the same scale.

17 6.2 PDB Summary PDBsum provides summary information for all proteins of known structure. The structure summary includes details of resolution and R factor, secondary structure, associated ligands, fold cartoons, ligand interactions, and so on. Brief summaries of the summary information are also available for each entry. PDBsuminformationresolution and R factorsecondary structureassociated ligands fold cartoonsligand interactionssummary

18 6.3 CATH CATH is a hierarchical classification of protein structural relationships derived using a combination of automatic and manual methods. CATH identifies different classes by means of a unique number (by analogy with the E.C. system for enzymes), as well as a descriptive name. The acronymn denotes: CATHhierarchical classification Class - the highest level of the classification, derived from overall secondary structure content and packing Architecture - describes the gross arrangement or orientation of secondary structures, independent of their connectivities Topology - relates both to the overall shape and connectivity of the secondary structures Homologous superfamily - clusters protein domains that share both sequence and structural similarity (and are hence believed to be homologous) In addition, a Sequence (S-level) subset clusters H-level structures on the basis of sequence identity. Domains in the same S-level have sequence identities >35% (with at least 60% of the larger domain equivalent to the smaller), indicating highly similar structures. Fig 6.3 depicts a few examples of architectures recognised in CATH.

19 6.3 CATH

20 6.4 SCOP The SCOP (Structural Classification Of Proteins) database describes structural and evolutionary relationships between proteins of known structure. The database has been constructed using a combination of manual inspection and automated methods, because current automatic sequence and structure comparison tools cannot identify all structural relationships reliably. Proteins are classified in a hierarchical fashion to reflect their structural and evolutionary relatedness.SCOPstructural and evolutionary relationships Within the hierarchy, the principal levels describe the family, superfamily and fold: proteins are clustered into families with clear evolutionary relationships if they have sequence identities >= 30% (but this is not an absolute measure); proteins are placed in superfamilies when, in spite of low sequence identity, their structural and functional characteristics suggest a common evolutionary origin; and proteins are classed as having a common fold if they have the same major secondary structures in the same arrangement and with the same topology, whether or not they have a common evolutionary origin. In these cases, the structural similarities could have arisen as a result of physical principles that favour particular packing arrangements and fold topologies. Boundaries between such levels may be subjective, but the higher levels generally reflect the clearest structural similarities.

21 6.5 EC->PDB The Enzyme Structures Database (EC->PDB) relates known enzyme structures deposited in the PDB to their Enzyme Commission (EC) classification and provides links to the ExPASy ENZYME Data Bank. The EC classification comprises the following broad categories:Enzyme Structures DatabaseENZYME E.C.1.-.-.- Oxidoreductases E.C.2.-.-.- Transferases E.C.3.-.-.- Hydrolases E.C.4.-.-.- Lyases E.C.5.-.-.- Isomerases E.C.6.-.-.- Ligases Entries also include links to the Kyoto Encyclopedia of Genes and Genomes, KEGG (an effort to computerise current knowledge of molecular and cellular biology in terms of interacting genes and molecular pathways), and the Evolutionary Analysis of Metabolism, PUMA2KEGGPUMA2

22 6.6 Visualisation of Protein Molecules Many protein structure viewers use atomic coordinates and secondary structure information directly from the PDB (these include RasMol, QuickPDB and Deep View). Others use their own format (e.g., Cn-3D). Each has various display options, to highlight different residues or residue types, display the atoms in different styles, etc., as shown in Fig 6.4. Programs such as Cn-3D and QuickPDB allow users to highlight areas of interest, which can be useful for mapping known motifs to 3D space (e.g., conserved regions in some families surround the active site, even if they are not close together in sequence).RasMolQuickPDBDeep ViewCn-3D

23 6.6 Visualisation of Protein Molecules Fig 6.4. Different protein structure viewers displaying the ubiquitin-like signalling protein, Nedd8 (PDB ID: 1NND). (A) Deep View, (B) Rasmol, (C) QuickPDB and (D) CN3D. (A) illustrates classical ball and stick mode, (B) cartoon mode, (C) a wireframe α-carbon trace, with a small section of the structure highlighted in blue, and (D) a hybrid display with amino acid chains in cartoon mode and non-amino acid atoms in space-filling mode.

24 6.6 Visualisation of Protein Molecules Some programs are more than just viewers. For example, Deep View has functions for superposition of different structures, virtual amino acid mutations, interfacing with Swiss- Model, and so on (see Chapter 7). There are several advanced features for structural biologists, including importing electron density maps to build structures, and various integrated modelling tools for energy minimisation.


Download ppt "Gene Annotation and Analysis Lab Work Reference: European Multimedia Bioinformatics Educational Resource."

Similar presentations


Ads by Google