Topic 10 Chapter 20, Du and Bourne “Structural Bioinformatics”

What is a domain Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx www.sdsc.edu/pb/edu/pharm201/15/15.pptx reasonable region of complexity

 Definition of protein domain is not well defined (to say the least), which makes it difficult to identify their boundaries  General Considerations: - compact, semi-independent units (close to spherical shape) * - interactions between domains are weak (small contact) - identifiable hydrophobic core (interface is more hydrophilic) ** -  -sheet is best preserved * Wetlaufer DB. PNAS 1973; 70:697-701 ** Swindells MB. Protein Science 1995; 4:103-112 Protein Domain

Multi-domain Proteins Redfern et al, PloS Computational Biology, 2007 Approximately 50% proteins are multi-domain (data from 2005). It could be as high as 80% in eukaryotes

From Wikipedia… A protein domain is a part of protein sequence and structure that can evolve, function, and exist independently (not likely now) of the rest of the protein chain. Each domain forms (formed?) a compact three-dimensional structure and often can be independently stable and folded. Many proteins consist of several structural domains. One domain may appear in a variety of different proteins. Molecular evolution uses domains as building blocks and these may be recombined in different arrangements to create proteins with different functions. Domains vary in length from between about 25 amino acids up to 500 amino acids in length. The shortest domains such as zinc fingers are stabilized by metal ions or disulfide bridges. (Is a single zinc finger really a domain?) Domains often form functional units, such as the calcium-binding EF-hand domain of calmodulin. (Is a single EF-hand really a domain?) Because they are independently stable, domains can be "swapped" by genetic engineering between one protein and another to make chimeric proteins. (Sometimes.)

EF-hands (domain or motif?) Calmodulin The EF-hand is another common structural element. In fact, the protein calmodulin has four of them.

What about a zinc finger? Zinc finger From Wikipedia: A zinc finger is a small protein structural motif that is characterized by the coordination of one or more zinc ions in order to stabilize the fold.

Quick aside about zinc fingers

Repeat proteins Ankyrin

Adding to the Complexity, Discontinuous Domains Redfern OC. et al, PloS Computational Biology, 2007 N-terminalC-terminal 33844 px c.56.5.4 d1cg2a1 1cg2 A:26-213,A:327-414 39360 px d.58.19.1 d1cg2a2 1cg2 A:214-326 SCOP Classification: About 20% of mutidomain proteins are not contiguous in sequence

Domain identification Any structure unclassified by the sequence-based methods are divided into their constituent domains (when appropriate). The domains are then resubmitted to the sequence and structure comparison protocols discussed previously. While there are many automatic domain identification algorithms, most result in significant numbers of incorrect assessments (20-30% incorrect). This is mainly due to the fact that there is no unique answer to the question, “What is a domain?” For example, one could easily envision various domain classification schemes based on sequence, phylogeny, and/or structure. Structure-based approaches are based on straightforward structural concepts: namely that (globular) proteins have hydrophobic cores, and that these cores should constitute a (semi)independent folding nucleus. Thus the automated methods attempt to (maximize, minimize) (intra, inter)- domain contacts. What about non-globular (i.e., intrinsically disordered or integral) proteins???

Domain identification Most automated domain identification methods are primarily based on this premise. However, as you might expect, there are myriad ways to implement such an idea. ADH

 Early works only apply to single-segment domains Crippen, 1978; Nemethy & Scheraga, 1979; Lesk & Rose, 1981; Rashin, 1981.  Current methods for multi-segment domains mostly use heuristics and approximations: Holm & Sander, 1994; Siddiqui & Barton, 1995; Swindells, 1995…….. Automatic Domain Partition Methods Note: the focus here is structural domain partition. While structure-based domain assignment is not a trivial problem, domain prediction from sequences is even more difficult. Any advances in sequence-based domain prediction will greatly improve protein structure prediction. ?

The general approach Basic principle for domain partition: inter-residue interactions are denser within domains than between domains

Top-down vs. Bottom-up Start with the entire structure and proceed through iterative partitions into smaller units. Define very small structural units and assemble them into domains. Over the years, an amazing array of approaches have been put forward to solve the domain ID problem. In spite of very different overall approaches, an interesting observation has been made: most algorithms correctly ID 70-80% domains within structures, but fail on the others due to complexity within some multi-domain proteins. The # of boundaries are both over-predicted leading to too many domains (overcut) or under-predicted leading to too few domains (undercut). Thus, the problem remaining is not “where does the boundary of the domain fall?”, but rather “is the identified boundary real?”

How do automatic methods work? 3D-coordinates of chain Predicted domains Make domains by putting together primitive units of secondary structure Bottom-up approach Parameters involved Make domains by partitioning chain into smaller units Top-down approach Step 1 Evaluate each potential domain using set of parameters (accept or reject given assignment) Step 2 Maximize hydrophobic core of the unit Maximize compactness of the unit Find mechanical hinge points between units Minimize interface area between units Minimum size of unit Maximize globularity Minimize cutting through secondary structures Maximum number of discontinuous fragments within the domain Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx www.sdsc.edu/pb/edu/pharm201/15/15.pptx

Two steps of algorithm design: Train the algorithm compare predicted domain assignments to “correct” domain assignments Tune parameters till the best level of prediction is achieved Validate the performance run the algorithm of an independent set of data Report % of correctly partitioned proteins Step A Step B Use expert data for domain assignments A problem: different algorithms use assignments from different experts for training and validation. Algorithms will reflect same propensities toward domain assignments as the expert method they rely upon. More seriously, there is no good objective way to compare the performance of different methods, as each uses different dataset for validation. Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx www.sdsc.edu/pb/edu/pharm201/15/15.pptx

Issues in Protein Domain Partition Compactness (contacts/#of residues……) Minimum domain size (35 amino acids [AA], 40AA…?) Minimum size to be considered for partition (80AA…?) Integrity of secondary structures (Is it ok to break  -sheet?) Most programs use top-down approach, what are the criteria for stops?

CATH Domain Classification  Use both automatic and manual techniques  If it has high sequence identity (80%) and structural similarity (SSAP score >= 80) with a protein chain X that has been classified in CATH, use the boundaries of X. Otherwise, apply several domain partition programs 1. DETECTIVE (Swindells, 1995), 2. PUU (Holm & Sander, 1994), 3. DOMAK (Siddiqui and Barton, 1995). If there is no consensus  assign manually.

Differences WARNING: Even though each method has about 70-80% accuracy based on benchmark tests, disagreement among methods is very big in terms of the number of domains, and domain boundaries. In CATH, if consensus is not found within a tolerance of 10 residues, the domains are manually assigned (right).

 DOMAK (Siddiqui and Barton, 1995). split value = (intA/extAB)*(intB/extAB) intA (B): the number of internal contacts in A (B) (contact: heavy atoms within 5 Å) extAB: the number of contacts between A and B  DETECTIVE (Swindells, 1995), hydrophobic core determination  PUU (protein unfolding units, Holm & Sander, 1994), harmonic model to describe inter-domain dynamics  Domainparser (Xu, 2000) graph algorithm---network flow Automatic Domain Partition Methods

DomainParser DomainParser (Xu et al, Bioinformatics 2000) uses a graph-theoretic algorithm for the decomposition of a multi-domain protein into individual structural domains. The underlying principle used is that residue-residue contacts are denser within a domain than between domains. The decomposition problem is recast as a network flow problem, in which each residue is represented as a node of a network and each residue-residue contact is represented as an edge with a particular capacity, depending on the type of the contact. A two-domain decomposition problem is solved by finding a cut of the network, which minimizes the total cross-edge capacity (minimum cut). To deal with networks with non-unique minimum cuts, the algorithm finds all cuts, which achieve the minimum cross-edge capacity. A recent analysis of four automatic methods put DomainParser (marginally) at the top (Holland et al, JMB, 2006) --- In fact, 3/4 were nearly equal depending on the evaluation criterion.

bottleneck interface Domain partitionNetwork flow Domain Partition as a Network Flow Problem Basic idea: identify the bottleneck Xu et al, Bioinformatics 2000 Guo et al, NAR 2003 Note: there is now a DomainParser 2

DomainParser Domain identification is recast as a network flow problem. Meaning, the method attempts to divide the network into two interconnected parts in such a way that the edge capacity across the division in minimized. (Note, each edge can carry different weights, or capacities.) Intuitively, this translates into finding the bottleneck within the network. The algorithm works by systematically removing nodes until domain separation is maximized. There is a second (post-processing) step that checks the validity of the domain boundaries using commonsense metrics like compactness, radius of gyration, number of non- contiguous segments per domain, and distribution of domain sizes. Because the method is based on topology, it is very fast. And, it scales very well as well O(nm 2 ), where n = # of nodes and m = # of nodes.

source sink capacity edge node Maximum Flow/Minimum Cut (bottleneck) Algorithm to solve this problem: Ford-Fulkerson Method We need to construct a graph first……

Node Capacity Source/sink Residue (C  ) Packing Extreme points Model Building for Domain Partition Issues: Compactness Minimum domain size Integrity of secondary structures When to stop Find the bottleneck

 Capacity between Residues A/B: (based on Holm & Sander 1994) (1) If atom distance <= 4.0 A, ++1; (2) If backbone contact, ++5; (3) If across a  -sheet, ++12; (4) If within a  -strand, ++1000. Capacity and Extreme Points Two farthest residues perpendicular to the axis Source Sink (sampling) Preserve  -sheet structure  Use multiple extreme points

* Violate compact globular requirement  Domains have very simple and/or extended structure (DomainParser 1 domain) 1aaya1zmec6prch Assignments by DomainParser vs. SCOP  DomainParser preserves  -sheet (DomainParser 1 domain) undercut

 Structurally correct decomposition by DomainParser (DomainParser: 2 domains) SCOP treats them as single domain proteins, functional consideration or ? Assignments by DomainParser vs. SCOP 2liv 2adma overcut

Holland, et al, JMB, 2006 Experts: CATH, SCOP, AUTHORS Domain Assignments by DomainParser

DomainParser tneds to undercut large mutlti-domain proteins Holland, et al, JMB, 2006

Summary of Performance Comparison Holland, et al, JMB, 2006

But PDP (Protein Domain Parser) is the winner Holland, et al, JMB, 2006

PDP is a recursive top-down algorithm that makes either: (1.) a single cut producing two contiguous domains or (2.) a double cut, where the cuts are at least 35 residues apart and within 8 Å of each other. The best cut is selected using criteria of minimum contacts between resulting domains, normalized by the size of the domains. The algorithm continues to recursively partition each of the resulting domains until a stopping condition is met. During the post-processing step, the number of contacts between resulting domains is evaluated and domains with a high level of contacts are merged together. Very small domains (below 35 residues) are discarded. But PDP (Protein Domain Parser) is the winner

Based on the criterion of correct number of assigned domains, PDP appears to be the most accurate method (85% correct) followed by NCBI (83%), DomainParser (77%), and PUU (74%). DomainParser is the most accurate on structures with few domains. However, it tends to under-cut many structures (4.5% over-cut, 18.5% under-cut). NCBI, on the other hand, shows a balance between over-cut and under-cut types of errors (9.9% over-cut, 7.6% under-cut). The performance of PDP is consistently superior to other methods; it is particularly impressive on chains with larger number of domains: the method assigns correctly four out of five, five-domain chains and is the only method to correctly assign a six-domain chain. In general the performance of NCBI is very similar overall as well as in its profile character to that of PDP; its assignment of four-domain chains is superior to that of PDP, but NCBI fails to assign correctly most of five-domain chains and both of the six-domain chains. Summary of Performance Comparison

Some insights from looking at automatic domain assignments: Maximizing ratio of intra- /inter-domain contacts is a chief principle in algorithmic assignments and work well for ‘standard’ cases. As more complex structures are solved, more cases of ‘unusual’ architecture are uncovered. These tend to defy our basic rules. It is possible to include more parameters and tune them better to avoid some obvious cases of overcuts: penalize splitting secondary structure elements (some cutting of secondary structures is essential to obtain ‘correct’ domain, but this feature should be carefully balanced) penalize domains consisting from too many short fragments (excessive fragmentation may result in very compact, but biologically unfeasible domains) improve the ability to recognize ‘classical’ folds (this will improve recognition of very small and very large domains for which contact density may be misleading) Shamelessly ‘borrowed’ from Phil Bourne’s notes: www.sdsc.edu/pb/edu/pharm201/15/15.pptx www.sdsc.edu/pb/edu/pharm201/15/15.pptx

http://pdomains.sdsc.edu Best practices: use a consensus approach

Topic 10 Chapter 20, Du and Bourne “Structural Bioinformatics”

Similar presentations

Presentation on theme: "Topic 10 Chapter 20, Du and Bourne “Structural Bioinformatics”"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Topic 10 Chapter 20, Du and Bourne “Structural Bioinformatics”

Similar presentations

Presentation on theme: "Topic 10 Chapter 20, Du and Bourne “Structural Bioinformatics”"— Presentation transcript:

Similar presentations

About project

Feedback