Protein Tertiary Structure Comparison Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Protein Threading Zhanggroup Overview Background protein structure protein folding and designability Protein threading Current limitations.
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Structural bioinformatics
Protein Structure Alignment Human Myoglobin pdb:2mm1 Human Hemoglobin alpha-chain pdb:1jebA Sequence id: 27% Structural id: 90% Another example: G-Proteins:
Strict Regularities in Structure-Sequence Relationship
Introduction to Structural Bioinformatics Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia.
Finding Compact Structural Motifs Presented By: Xin Gao Authors: Jianbo Qian, Shuai Cheng Li, Dongbo Bu, Ming Li, and Jinbo Xu University of Waterloo,
Appendix: Automated Methods for Structure Comparison Basic problem: how are any two given structures to be automatically compared in a meaningful way?
The Protein Data Bank (PDB)
Protein threading Structure is better conserved than sequence
BMI 731 Protein Structures and Related Database Searches.
Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Model Database. Scene Recognition Lamdan, Schwartz, Wolfson, “Geometric Hashing”,1988.
Protein Structure Alignment
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Protein Structure Prediction and Analysis
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.
Protein Tertiary Structure Prediction
Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.
Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.
Chapter 9 Superposition and Dynamic Programming 1 Chapter 9 Superposition and dynamic programming Most methods for comparing structures use some sorts.
Genomics and Personalized Care in Health Systems Lecture 9 RNA and Protein Structure Leming Zhou, PhD School of Health and Rehabilitation Sciences Department.
Protein Sequence Alignment and Database Searching.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Structure superposition ≠ Structure alignment Lecture 11 Chapter 16, Du and Bourne “Structural Bioinformatics”
Computational Biology, Part 3 Sequence Alignment Robert F. Murphy Copyright  1996, All rights reserved.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
Part I : Introduction to Protein Structure A/P Shoba Ranganathan Kong Lesheng National University of Singapore.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,
DALI Method Distance mAtrix aLIgnment
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.
Protein Folding & Biospectroscopy Lecture 6 F14PFB David Robinson.
Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)
Lecture 10 CS566 Fall Structural Bioinformatics Motivation Concepts Structure Solving Structure Comparison Structure Prediction Modeling Structural.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Find the optimal alignment ? +. Optimal Alignment Find the highest number of atoms aligned with the lowest RMSD (Root Mean Squared Deviation) Find a balance.
Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:
Chapter 14 Protein Structure Classification
Multiple sequence alignment (msa)
Protein Folding and Protein Threading
Classification: understanding the diversity and principles of
Protein Structures.
Giovanni Settanni, Antonino Cattaneo, Paolo Carloni 
Protein structure prediction.
DALI Method Distance mAtrix aLIgnment
Protein Structure Alignment
Presentation transcript:

Protein Tertiary Structure Comparison Dong Xu Computer Science Department 271C Life Sciences Center 1201 East Rollins Road University of Missouri-Columbia Columbia, MO (O)

Lecture Outline l Why structural alignment l Technical definition l SSAP l DALI l Fast search l Protein families

Structure Is Better Conserved during Evolution Structure can adopt a wide range of mutations. Physical forces favor certain structures. Concept of fold. Number of fold is limited. Currently ~1000 Total: 1,000s ~10,000s TIM barrel

Alignment of Protein Structure l Three-dimensional structure of one protein compared against three-dimensional structure of second protein l Atoms (protein backbones) fit together as closely as possible to minimize the average deviation

Why Align Structures? (1) l Additional measure of protein similarity l Structure generally preserved better than sequence over the course of evolution l Provide more information on the relationship between proteins than what sequence alignment can offer l Allows classification of proteins based on structural similarities

Why Align Structures? (2) l Basis for protein fold identification (prediction) l Sometimes sequence similarity between two proteins exists, but is not strong enough to produce an unambiguous alignment (gold standard for sequence comparison). l Pinpoint the active sites more accurately. l Allows identification of common sub- structures of interest

Why Align Structures? (3) Illustrate features of protein family: Evolution of the globin family

Illustrate interesting evolutionary/functional relationship between proteins: Two ferredoxins, 1DOI and 1AWD, are aligned structurally, showing an insertion in 1DOI that contains potassium-ion binding sites. This may be the result of adaptations to the high salt environment of the Dead Sea. Why Align Structures? (4)

Lecture Outline l Why structural alignment l Technical definition l SSAP l DALI l Fast search l Protein families

T Simple case – two closely related proteins with the same number of amino acids. Structure alignment Find a transformation to achieve the best superposition

Transformations o Translation o Translation and Rotation -- Rigid Motion (Euclidian space)

Types of Structure Comparison o Sequence-dependent vs. sequence- independent structural alignment o Global vs. local structural alignment o Pairwise vs. multiple structural alignment

Given two sets of 3-D points : P={p i }, Q={q i }, i=1,…,n; rmsd(P,Q) = √  i |p i - q i | 2 /n (root mean square deviation) Find a 3-D rigid transformation T * such that: rmsd( T * (P), Q ) = min T √  i |T(p i ) - q i | 2 /n Sequence-dependent Structure Comparison (1)

ASCRKLE ¦¦¦¦¦¦¦ ASCRKLE Minimize rmsd of distances 1-1,...,7-7 Sequence-dependent Structure Comparison (2)

Sequence-dependent Structure Comparison (3) o Can be solved in O(n) time. o Useful in comparing structures of the same protein solved in different methods, under different conformation, through dynamics. o Evaluation protein structure prediction.

Correspondence is Unknown! find T which produces “largest” superimpositions of corresponding 3-D points. Given two configurations of points in the three dimensional space, T Sequence-independent Structure Comparison

Order-Dependent vs. Order-Independent Comparison residues of protein sequence Alignment (order dependent): a correspondence between elements of two sequences with order (topology) kept (typical structural alignment) bipartite matching (order- independent): one-to-one matching FSEYTTHRGHR : ::::: :: FESYTTHRPHR FESYTTHRGHR :::::::: :: FESYTTHRPHR

1. Number of amino acid correspondences created. 2. RMSD of corresponding amino acids 3. Percent identity in aligned residues 4. Number of gaps introduced 5. Size of the two proteins 6. Conservation of known active site environments … No universally agreed upon criteria. It depends on what you are using the alignment for. Evaluating Structural Alignments

1ABR:B - ABRIN-A 1BAS:_ - BASIC FIBROBLAST GROWTH FACTOR (BFGF) Seq. identity = 10% RMSD = 1.9Å Structural Alignment Output

Lecture Outline l Why structural alignment l Technical definition l SSAP l DALI l Fast search l Protein families

How to recognize structural similarities 1.By eye (SCOP) 2. Algorithmically opoint-based methods use properties of points (distances) to establish correspondence  Dynamic programming (SSAP)  Distance matrix (DALI) osecondary structure-based methods use vectors representing secondary structures to establish correspondences (LOCK). oImage processing based method.

Structural Comparison Algorithms l Due to the high compute complexity, practical algorithms rely on heuristics l Fully automated structure analysis has not been as successful as analyses with human intervention in taking in to account the biological implications

SSAP l SSAP: Secondary Structure Alignment Program l Incorporates double dynamic programming to produce a structural alignment between two proteins

The similarity between residue i in molecule A and residue k in molecule B is characterised in terms of their structural surroundings This similarity can be quantified into a score, S ik Based on this similarity score and some specified gap penalty, dynamic programming is used to find the optimal structural alignment Basic Ideas of SSAP

Distance between residue i & j in molecule A ; d A i,j Similarity for two pairs of residues, i j in A & k l in B ; a,b constants Scoring Function of SSAP (1) i j l k

Similarity between residue i in A and residue k in B ; S i,k is big if the distances from residue i in A to the 2n nearest neighbours are similar to the corresponding distances around k in B Scoring Function of SSAP (2)

This works well for small structures and local structural alignments - however, insertions and deletions cause problems  unrelated distances HSERAHVFIM.. GQ-VMAC-NW.. i=5 k=4 A : B : The actual SSAP algorithm uses Dynamic programming on two levels, first to find which distances to compare  S ik, then to align the structures using these scores Alignment Gaps in SSAP

Steps in SSAP (1) 1) Calculate vectors from C  of one amino acid to set of nearby amino acids å Vectors from two separate proteins compared å Difference (expressed as an angle) calculated, and converted to score 2) Matrix for scores of vector differences from one protein to the next is computed.

3) O ptimal alignment found using global dynamic programming, with a constant gap penalty 4) Next amino acid residue considered, optimal path to align this amino acid to the second sequence computed Steps in SSAP (2)

5) A lignments transferred to summary matrix å If paths cross same matrix position, scores are summed å If part of alignment path found in both matrices, evidence of similarity Steps in SSAP (3)

6) D ynamic programming alignment is performed for the summary matrix å Final alignment represents optimal alignment between the protein structures å Resulting score converted so it can be compared to see how closely related two structures are Steps in SSAP (4)

Summary of SSAP

Lecture Outline l Why structural alignment l Technical definition l SSAP l DALI l Fast search l Protein families

Distance Matrix Approach l Uses graphical procedure similar to dot plots l Identifies residues that lie most closely together in three-dimensional structure l Two sequences with similar structure can have dot plots superimposed

Distance Matrix l Similar 3D structures have similar inter-residue distances

DALI l Distance Alignment Tool (DALI) l Uses distance matrix method to align protein structures l Assembly step uses Monte Carlo simulation to find submatrices that can be aligned

DALI Summary

l DALI is based on distance matrices – 2D matrices containing all pairwise distances between points of a molecule l Distance matrices of two molecules are compared to find regions of similar patterns of distances, which indicate similarities in their 3D structure l Key algorithm steps: 1. Divide distance matrices into overlapping sub-matrices of fixed size 2. Search through two matrices (of two molecules) to find similar patterns 3. Assemble matching pairs of sub-matrices in to larger sets to maximize their similarity score Structural Analysis Algorithms – DALI (1)

Structural Analysis Algorithms – DALI (2) l Assembly of aligned sub-matrices is done using a Monte Carlo optimization l Monte Carlo optimization is an iterative improvement by a random walk exploration of the search space, with occasional excursions in to non-optimal territory (i.e. occasionally, a move that reduces the overall score is carried out) l The occasional non-optimal moves help avoid getting “trapped” in local optima of the score function, improving the chance of finding the global optimum

DALI Steps (1)

DALI Steps (2)

DALI Steps (3)

Lecture Outline l Why structural alignment l Technical definition l SSAP l DALI l Fast search l Protein families

Fast Structural Similarity Search l Compare types and arrangements of secondary structures within two proteins l If elements similarly arranged, three- dimensional structures are similar l LOCK, VAST and SARF are programs that use these fast methods

Align Structures by Secondary Structures

Structural Analysis Algorithms – LOCK l Both SSAP and DALI deal only with points (atoms) of the molecules l LOCK uses a hierarchical approach å Larger secondary structures such as helixes and strands are represented using vectors and dealt with first å Individual residues are dealt with afterwards å Assumes large secondary structures provide most stability and function to a protein, and are most likely to be preserved during evolution

LOCK Algorithm l Key algorithm steps: 1. Represent secondary structures as vectors 2. Obtain initial superposition by computing local alignment of the secondary structure vectors (using dynamic programming) 3. Compute residue superposition by performing a greedy search to try to minimize root mean square deviation (a RMS distance measure) between pairs of nearest backbone atoms from the two proteins 4. Identify “core” (well aligned) atoms and try to improve their superposition (possibly at the cost of degrading superposition of non-core atoms) l Steps 2, 3, and 4 require iteration at each step

ProteinDBS Shyu, Chi, Scott, Xu. Nucleic Acid Research. 32, W572 - CW575, 2004

Comparison between different methods l CATH CATH å Fully automated å SSAP l SCOP SCOP å Based on subjective interpretation of evolutionary history of proteins l FSSP FSSP å DALI l Agreement between CATH and SCOP may be at most 60%. å FSSP vs CATH 40% å FSSP vs SCOP 60%

Lecture Outline l Why structural alignment l Technical definition l SSAP l DALI l Fast search l Protein families

Structure Families (1) Homologous family: evolutionarily related with a significant sequence identity; Superfamily: different families whose structural and functional features suggest common evolutionary origin; Fold: different superfamilies having same major secondary structures in same arrangement and with same topological connections (energetics favoring certain packing arrangements); Class: secondary structure composition.

6 Classes of Protein Structures (1) 1) Class  : bundles of  helices connected by loops on surface of proteins 2) Class  : antiparallel  sheets, usually two sheets in close contact forming sandwich 3) Class  /  : mainly parallel  sheets with intervening  helices; may also have mixed  sheets (metabolic enzymes)

4) Class  +  : mainly segregated  helices and anti-parallel  sheets 5) Multi-domain (  and  ) proteins more than one of the above four domains 6) Membrane and cell-surface proteins and peptides excluding proteins of the immune system 6 Classes of Protein Structures (2)

Structure of  class proteins

Structure of  class proteins

Structure of  class proteins

Structure of  class proteins

20 most frequent common domains (folds)

Reading Assignments l Suggested reading: å Contemporary approaches to protein structure classification. Mark B. Swindells, et al. BioEssay. Volume 20, Issue 11, 1998, Pages: l Optional reading: å The structural alignment between two proteins: Is there a unique answer? Adam Godzik, Protein Science (1996), å Protein Structure Similarities. Patrice Koehl, Current Opinions in Structural Biology (2001),

Develop a program that can perform protein structural alignment using SSAP: 1. The C  coordinates of two proteins (A and B) of will be sent to the mailing list 2. Calculate the similarity matrix between residue i in A and residue k in B (let n = 4, a = b = 1): 3. Perform dynamic programming on S i,k, and retrieve the alignment to print out. Project Assignment

Project Phase III Report l Due on 11/17, send me through l Write on top of Phase II report. l 7-30 Pages l As a draft of the final report l Free style in writing (use 11pt font or larger) l Present key results å Software implementation å Benchmark (computing time) å Computational data å Interpret the meaning of the data