CATH and SCOP Topic 8 Chapters 17 & 18, Gu and Bourne “ Structural Bioinformatics”

Slides:

Advertisements

Similar presentations

Using Ontology Reasoning to Classify Protein Phosphatases K.Wolstencroft, P.Lord, L.tabernero, A.brass, R.stevens University of Manchester.

Advertisements

Secondary structure prediction from amino acid sequence.

Web Resources for Bioinformatics Vadim Alexandrov and Mark Gerstein.

C A T H C A T H lass rchitecture opology or Fold Group

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.

Pfam(Protein families )

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.

©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.

Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.

Alpha/Beta structures Barrels, sheets and horseshoes.

Protein structure. Amino acids Amino acids: R group properties.

Strict Regularities in Structure-Sequence Relationship

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Centre for Integrative Bioinformatics.

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Jaap Heringa Integrative Bioinformatics.

Protein structure (Part 2 of 2).

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

The Protein Data Bank (PDB)

ProteinStructuralDatabases. Proteins are built from amino-acids. Introduction H | NH2-c-CO2H | R.

Protein Modules An Introduction to Bioinformatics.

Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.

Protein structures in the PDB

Protein structure Classification Ole Lund, Associate professor, CBS, DTU.

PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.

Protein Structure Prediction II

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Protein Tertiary Structure Prediction Structural Bioinformatics.

Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Pairwise sequence alignments Dynamic programming (Needleman-Wunsch), finds optimal alignment Heuristics: Blast (Altschul et al) does not guarantee finding.

Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.

Automatic methods for functional annotation of sequences Petri Törönen.

Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.

Bioinformatics master course DNA/Protein structure-function analysis and prediction Lecture 5: Protein Fold Families Centre for Integrative Bioinformatics.

PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.

Gene Annotation and Analysis Lab Work Reference: European Multimedia Bioinformatics Educational Resource.

Exploiting Structural and Comparative Genomics to Reveal Protein Functions  Predicting domain structure families and their domain contexts  Exploring.

CATH – a hierarchic classification of protein domain structures Rui Kuang.

PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)

BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.

Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.

NIGMS Protein Structure Initiative: Target Selection Workshop ADDA and remote homologue detection Liisa Holm Institute of Biotechnology University of Helsinki.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.

Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,

Protein and RNA Families

Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:

PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),

Comparing and Classifying Domain Structures

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.

Statistical Tests We propose a novel test that takes into account both the genes conserved in all three regions ( x 123 ) and in only pairs of regions.

Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.

Protein Classification

Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

InterPro Sandra Orchard.

Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.

The Biologist’s Wishlist A complete and accurate set of all genes and their genomic positions A set of all the transcripts produced by each gene The location.

Considerations for multi-omics data integration Michael Tress CNIO,

Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)

Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:

Chapter 14 Protein Structure Classification

Demo: Protein Information Resource

Genome Annotation Continued

Classification: understanding the diversity and principles of

A brief on: Domain Families & Classification

A brief on: Domain Families & Classification

Presentation transcript:

CATH and SCOP Topic 8 Chapters 17 & 18, Gu and Bourne “ Structural Bioinformatics”

SCOP vs. CATH  The basic classification unit is domain. -- A distinct structural unit and may fold as an independent, compact unit. -- Considered as the basic unit of protein folding, function and evolution. -- Domain partition, either manual or automatic, is not trivial.  CATH is semi-automatic SCOP is mainly manual (human expertise) Similarity level high relationship clear low class fold superfamily family

Protein domains Pyruvate kinaseElongation faction EF-Tu

SCOP vs. CATH “CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: , 1997.

Protein Structure Classification Why bother?  Provides structural and evolutionary relationship  Provides current fold space  Assists protein structure prediction (details later) Two popular protein classification databases:  SCOP (Structural Classification Of Proteins ) Latest release: v1.75 (June 2009) 110,800 domains Murzin et al. J. Mol. Biol. 247, , 1995  CATH: Class (C), Architecture (A), Topology (T) and Homologous superfamily (H). Recent release: v3.5 (Sept 2011) 173,536 domains Orengo et al. Structure, 5, , 1997

Hierarchical Structure Classification “CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: , 1997.

Hierarchical Structure Classification  SCOP Class (  ) Fold (TIM beta/alpha-barrel) Superfamily (Triosephosphate isomerase) Family (Triosephosphate isomerase)  CATH Class (  ) Architecture (  -barrel) Topology (TIM barrel) Homologous Superfamily (Aldolase class 1) Sequence Family (Isomerase)

Hierarchical Structure Classification SCOPC FSF CATHCATHS Conservation: 2 o structure content High-level structure similarity Typically orthologs Lower-level structure similarity

This seems trivial. Why are we wasting valuable class time discussing it? While a hierarchical description of protein structure is conceptually straightforward, as you will see, automating it is not. Moreover, the domain boundary problem is actually quite difficult. Also, this discussion is nice in the sense that it ties together a lot of different bioinformatics concepts into one unified effort. Some of these concepts are structural; however, many are not.

Flavodoxin (toplogy = Rossman fold) Domain 1 of  -lactamase – Same architecture, but different topology Topology vs. Architecture Caution: Due to how secondary structures are interconnected, varying topologies can converge on the same overall architecture. 

An even trickier example “CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: , 1997.

The CATH Classification Strategy (1.) Close relatives are identified via sequence comparisons. (2.) Sequence profiles and structure comparison protocols are used to detect more distant homologies. (3.) Structures unclassified at this stage are then examined using both automatic and manual procedures to determine domain boundaries. (4.) Unclassified domain structures are recomputed using the methods employed in steps 2 and 3. (5.) Finally, structure(s) remaining unclassified are manually assigned to existing or new architectures within CATH.

The CATH Classification Strategy “CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: , Automatic procedure “If a given domain has sufficiently high sequence and structural similarity (ie. 35% sequence identity, SSAP score >= 80) with a domain that has been previously classified in CATH, the classification is automatically inherited from the other domain”.

CATH Classification-Domain Assignment Since the classification is performed on individual domains, therefore the very first step is to assign domains (find domain boundaries)  Use both automatic and manual techniques  If it has high sequence identity (80%) and structural similarity (SSAP score >= 80) with a protein chain X that has been classified in CATH, use the boundaries of X. Otherwise, apply several domain partition programs (CATHEDRAL, DETECTIVE (Swindells, 1995), PUU (Holm & Sander, 1994), DOMAK (Siddiqui and Barton, 1995). Consensus  assign automatically No consensus  assign manually. Domain Partition Problem SSAP (Sequential Structure Alignment Program) Structure Comparison Problem

The CATH Hierarchy and Classification Class, C-level  Based on the secondary structure content of the domain  There are four classes: 1. mainly-alpha, 2. mainly-beta, 3. alpha-beta, (a combination of  /  and  +  in SCOP) 4. low secondary structure content

The CATH Hierarchy and Classification Architecture, A-level  This level describes the overall shape of the domain structure as determined by the orientations of the secondary structures but ignores the connectivity between the secondary structures.  It is assigned manually using a simple description of the secondary structure arrangement.

The CATH Hierarchy and Classification Topology (Fold family), T-level  Members in the fold family share the same overall shape and connectivity of the secondary structures in the domain core.  Domains in the same fold group may have different structural decorations to the common core.

The CATH Hierarchy and Classification Topology (Fold family), T-level xx

The CATH Hierarchy and Classification Homologous Superfamily, H-level  Protein domains in each H-group are thought to share a common ancestor and are homologous.  Similarities are identified either by high sequence identity or structure similarity using SSAP. Domains are classified in the same homologous superfamily if they satisfy one of the following criteria: 1.Sequence identity >= 35%, overlap >= 60% of larger structure equivalent to smaller. 2.SSAP score >= 80.0, sequence identity >= 20%, 60% of larger structure equivalent to smaller. 3.SSAP score >= 70.0, 60% of larger structure equivalent to smaller, and domains which have related functions, which is informed by the literature and Pfam protein family database (Bateman et al., 2004). 4.Significant similarity from HMM-sequence searches and HMM-HMM comparisons using SAM (Hughey &Krogh, 1996), HMMER ( and PRC (

The CATH Hierarchy and Classification Sequence Family Levels: (S,O,L,I,D) Domains within each H-level are subclustered into sequence families using multi-linkage clustering at the following levels: ***The D-level is assigned as a counter within each S100 family to ensure that each domain in CATH has a unique CATHSOLID classification

Not completely manual: SCOP Workflow Andreeva, et al, NAR, 2008

SCOP: Structural Classification of Proteins Family: Clear evolutionarily relationship (1) pairwise residue identities between the proteins are 30% and greater. (2) Proteins with low sequence similarity but very similar functions and structures; for example, many globins have sequence identities of only 15%. Superfamily: Probable common evolutionary origin Proteins that have low sequence identities, but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. For example, actin, the ATPase domain of the heat shock protein, and hexakinase together form a superfamily. Fold: Major structural similarity (1) have same major secondary structures in same arrangement and with the same topological connections. (2) Proteins placed together in the same fold category may not have a common evolutionary origin: the structural similarities could arise just from the physics and chemistry of proteins favoring certain packing arrangements and chain topologies. Class: secondary structure content and organization Murzin et al. J. Mol. Biol. 247, , 1995

SCOP Parsable Files-very useful!!

Common Folds Immunoglobulin fold All-β protein sandwich Consists of 2 layers ~7 antiparallel β-strands arranged in two β-sheets 28 SCOP superfamilies Tim barrel fold  /β protein fold Named after glycolytic enzyme triosephosphate isomerase Eight α-helices and eight parallel β-strands 33 SCOP superfamilies Rossman fold  /β protein fold Named after Michael Rossman Parallel β-strands connected by  -helices 12 SCOP families

Common Folds “CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: , 1997.

Immunoglobulin-like beta-sandwich VL CL VH CH1 CH2 CH3 Antibody domainsCuZn Superoxide Dismutase

Red = Rossman fold domain within the enzyme alcohol dehydrogenase

Rossman fold TIM-Barrel fold

TIM-barrel 33 different superfamilies that share the same fold

Aldolase Enolase

TIM-barrel

SCOP CATH

“A systematic comparison of protein structure classifications: SCOP, CATH and FSSP” Caroline Hadley and David T Jones, Structure, 7(9): , 1999 SCOP vs. CATH The number of domains into which each chain is separated in S C O P and CATH is compared. For the most part, the two classification schemes agree on the number of domains per chain (5681 of 6875 chains is ∼ 82% agreement). However, in the case of chains split into two domains in CATH, almost half are considered as only one domain within S C O P.

SCOP vs. CATH “A systematic comparison of protein structure classifications: SCOP, CATH and FSSP” Caroline Hadley and David T Jones, Structure, 7(9): , 1999 SCOP: small protein CATH: mainly   CATH ignores the presence of small β strands in the lysozyme superfamily and considers the protein mainly α  SCOP takes into account the functional and evolutionary importance of these strands, and classifies the lysozymes α/β.

The Russian doll effect The recurrence of common motifs within many of the superfolds and major architectures gives rise to an overlap of structures in these regions of fold space. This means that it becomes harder to distinguish between structural families for these architectures and it is perhaps more appropriate to consider a continuum of protein folds. This is particularly apparent in the layer-based sandwich architectures of the mainly β and α−β classes. For example, within the α−β three-layer doubly wound architectures, it is possible to generate a very large family of structures. Each new structure added to a family will be related to the last by a simple extension of one or more βαβ motifs and the structures are then embedded within each other in a ‘Russian doll’ like effect. “CATH – A hierarchic classification of protein structure domains” Oregngo et al., Structure, 5: , 1997.

Meaning, the designations implied below aren’t so definitive

Structural redundancy (563) (423) (283) AN ASIDE: Commonly, SCOP/CATH classifications are used to remove structural redundancy from a dataset. For example, the plots are above are from a paper that my lab published characterizing a catalytic site prediction algorithm.

Other ways of classifying proteins: EC # The EC (enzyme classification) system creates a controlled vocabulary vis-à-vis enzyme function. Ribose-phosphate diphosphokinase

G6P Phosphatase vs. Hexokinase Glucose-6-Phosphate Phosphatase EC EC

Other ways of classifying proteins: EC #

From a former QE question Given the following information about two proteins, Protein A and B, what can you tell about the structural and functional similarity/dissimilarity of the two proteins? Comment on the evolution of the two proteins. EC NumberCATH Classification Protein A Protein B

From Wikipedia: Gene ontology, or GO, is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species. More specifically, the project aims to: 1.Maintain and develop its controlled vocabulary of gene and gene product attributes; 2.Annotate genes and gene products, and assimilate and disseminate annotation data; 3.Provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis. GO is part of a larger classification effort, the Open Biomedical Ontologies (OBO). Other ways of classifying proteins: GO

Other ways of classifying proteins: KEGG KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular- level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies