EMBL-EBI MSDfold (SSM) A web service for protein structure comparison and structure searches Eugene Krissinel

Slides:

Advertisements

Similar presentations

1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.

Advertisements

Measuring the degree of similarity: PAM and blosum Matrix

Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.

Introduction to Bioinformatics

Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.

Structural bioinformatics

Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.

Heuristic alignment algorithms and cost matrices

Department of Computer Science, University of California, Santa Barbara August 11-14, 2003 CTSS: A Robust and Efficient Method for Protein Structure Alignment.

Agenda A brief introduction The MASS algorithm The pairwise case Extension to the multiple case Experimental results.

Tertiary protein structure viewing and prediction July 5, 2006 Learning objectives- Learn how to manipulate protein structures with Deep View software.

Proteins  Proteins control the biological functions of cellular organisms  e.g. metabolism, blood clotting, immune system amino acids  Building blocks.

Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.

The Protein Data Bank (PDB)

Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.

Protein threading Structure is better conserved than sequence

Protein structure prediction May 30, 2002 Quiz#4 on June 4 Learning objectives-Understand difference between primary secondary and tertiary structure.

A unified statistical framework for sequence comparison and structure comparison Michael Levitt Mark Gerstein.

1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.

BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.

Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.

Sequence alignment, E-value & Extreme value distribution

Protein Tertiary Structure Prediction Structural Bioinformatics.

Or, What is a correspondence set anyway?! Topic 12 Chapter 16, Du and Bourne “Structural Bioinformatics”

1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.

Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.

IBGP/BMI 705 Lab 4: Protein structure and alignment TA: L. Cooper.

Chapter 12 Protein Structure Basics. 20 naturally occurring amino acids Free amino group (-NH2) Free carboxyl group (-COOH) Both groups linked to a central.

Structural alignment Protein structure Every protein is defined by a unique sequence (primary structure) that folds into a unique.

Structural alignments of Proteins using by TOPOFIT method Vitkup D., Melamud E., Moult J., Sander C. Completeness in structural genomics. Nature Struct.

Protein Sequence Alignment and Database Searching.

PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.

A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.

Sequence analysis: Macromolecular motif recognition Sylvia Nagl.

Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.

Sequence Based Analysis Tutorial NIH Proteomics Workshop Lai-Su Yeh, Ph.D. Protein Information Resource at Georgetown University Medical Center.

A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.

Protein Structure Comparison. Sequence versus Structure The protein sequence is a string of letters: there is an optimal solution (DP) to the problem.

Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.

Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009

HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.

Protein Strucure Comparison Chapter 6,7 Orengo. Helices α-helix4-turn helix, min. 4 residues helix3-turn helix, min. 3 residues π-helix5-turn helix,

DALI Method Distance mAtrix aLIgnment

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Pharm 201 Lecture 10, Reductionism and Classification Require Detailed Comparison Consider 3D Comparison Pharm 201/Bioinformatics I Philip E. Bourne.

A data-mining approach for multiple structural alignment of proteins WY Siu, N Mamoulis, SM Yiu, HL Chan The University of Hong Kong Sep 9, 2009.

Manually Adjusting Multiple Alignments Chris Wilton.

Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.

Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.

Pair-wise Structural Comparison using DALILite Software of DALI Rajalekshmy Usha.

Sequence Alignment.

Step 3: Tools Database Searching

The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.

Protein Sequence Alignment Multiple Sequence Alignment

Lecture 11 CS5661 Structural Bioinformatics – Structure Comparison Motivation Concepts Structure Comparison.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

EMBL-EBI Eugene Krissinel SSM - MSDfold. EMBL-EBI MSDfold (SSM)

An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.

Local Flexibility Aids Protein Multiple Structure Alignment Matt Menke Bonnie Berger Lenore Cowen.

More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.

Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.

Structural Bioinformatics Elodie Laine Master BIM-BMC Semester 3, Genomics of Microorganisms, UMR 7238, CNRS-UPMC e-documents:

EBI is an Outstation of the European Molecular Biology Laboratory. PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches.

Chapter 14 Protein Structure Classification

Protein Structure Comparison

Sequence Based Analysis Tutorial

Sequence Based Analysis Tutorial

Protein structure prediction.

DALI Method Distance mAtrix aLIgnment

Presentation transcript:

EMBL-EBI MSDfold (SSM) A web service for protein structure comparison and structure searches Eugene Krissinel

EMBL-EBI Structure alignment Structure alignment may be defined as identification of residues occupying “equivalent” geometrical positions  Unlike in sequence alignment, residue type is neglected  Used for  measuring the structural similarity  protein classification and functional analysis  database searches

EMBL-EBI Methods  Many methods are known:  Distance matrix alignment (DALI, Holm & Sander, EBI)  Vector alignment (VAST, Bryant et. al. NCBI)  Depth-first recursive search on SSEs (DEJAVU, Madsen & Kleywegt, Uppsala)  Combinatorial extension (CE, Shindyalov & Bourne, SDSC)  Dynamical programming on C  (Gerstein & Levitt)  Dynamical programming on SSEs (SSA, Singh & Brutlag, Stanford University)  many other  SSM employs a 2-step procedure: A Initial structure alignment and superposition using SSE graph matching B C  - alignment

EMBL-EBI E. M. Mitchell et al. (1990) J. Mol. Biol. 212:151     L  SSE graphs differ from conventional chemical graphs only in that they are labelled by vectors of properties. In graph matching, the labels are compared with tolerances chosen empirically. Graph representation of SSEs

EMBL-EBI SSE graph matching H1H1 S1S1 S2S2 S3S3 S4S4 H2H2 H1H1 H2H2 H3H3 H4H4 S1S1 H5H5 H6H6 S2S2 S3S3 S4S4 S5S5 S6S6 S7S7 A B H1H1 S1S1 S2S2 H2H2 S3S3 S4S4 S5S5 S6S6 S7S7 H3H3 H4H4 H5H5 H6H6 B H1H1 S1S1 S2S2 S3S3 S4S4 H2H2 A Matching the SSE graphs yields a correspondence between secondary structure elements, that is, groups of residues. The correspondence may be used as initial guess for structure superposition and alignment of individual residues.

EMBL-EBI matched helicesmatched strands chain A chain B  SSE-alignment is used as an initial guess for C  -alignment  C  -alignment is an iterative procedure based on the expansion of shortest contacts at best superposition of structures  C  -alignment is a compromise between the alignment length N align and r.m.s.d. Longest contacts are unmapped in order to maximise the Q -score: C  - alignment

EMBL-EBI  More than 2 structures are aligned simultaneously  Multiple alignment is not equal to the set of all-to-all pairwise alignments  Helps to identify common structure motifs for a whole family of structures Multiple structure alignment

EMBL-EBI Iterative removal of non-aligning SSEs best pairwise alignments A B C Helices may be multiply aligned from pairwise relations Strandsdo not multiply align, but one still can try to align them by probing alternative (not best) alignments

EMBL-EBI 4 alternative pairwise alignments A B C make up to 4 multiple alignments: A1 - B1 - C1 A1 - B2 - C1 A2 - B1 - C1 A2 - B2 - C1 Complexity prohibitive for structures Iterative removal of non-aligning SSEs

EMBL-EBI Heuristics: A B C remove non-aligning SSE with lowest alignment score Calculate all-to-all pairwise alignments Are there non- aligning SSEs? Remove one non- aligning SSE with lowest score QuitStart YesNo and reiterate all alignment Iterative removal of non-aligning SSEs

EMBL-EBI Multiple C  refinement Central star & consensus A B C X Superpose structures and calculate consensus structure X Score improved? Quit Multiple SSE alignment Initial C  alignment Choose structure, closest to X, as central star  and align all the rest to   Unmap groups of atoms with highest distance score D in order to maximise the score YesNo

EMBL-EBI Pairwise Alignment vs. Multiple Alignment Best pairwise alignment of 1SAR:A and 1D1F:B includes only  -sheet Addition of 1MGW:A (close neighbour to 1SAR:A) spots out a common motif of  - sheet and  -helix

EMBL-EBI SSM server map

EMBL-EBI  Table of matched Secondary Structure Elements  Table of matched backbone C  -atoms with distances between them at best structure superposition  Rotation-translation matrix of best structure superposition  Visualisation in Jmol and Rasmol  r.m.s.d. of C  -alignment  Length of C  -alignment N align  Number of gaps in C  -alignment  Quality score Q  Statistical significance scores P(S), Z  Sequence identity SSM output

EMBL-EBI  P -value is estimated using Q -scores of SSE deviations  P(S) is the probability of getting a score equal to S or higher at random picking structures from the PDB x1x1 xixi xnxn  P(S) is calibrated on SCOP folds  P(S) is often expressed through Z -score Statistical significance of alignments

EMBL-EBI Maximal Q-score d1di2a_ (69 res) Q-score0.213 RMSD2.43 N align 67/184 P0.55 Lowest RMSD d1emn_1 (43 res) Q-score0.019 RMSD0.9 N align 13/184 P0.075 Highest N align d1elxb_ (449 res) Q-score0.02 RMSD5.82 N align 89/184 P~1 Scoring at low structural similarity - 1KNO:A vs SCOP 1.61

EMBL-EBI Performance data s

EMBL-EBI Sequence alignment Based on residue identity, sometimes with a modified alphabet --AARNEDDDGKMPSTF-L E-AARNFG-DGK--STFIL Used for:  evolution studies  protein function analysis  guessing on structure similarity Algorithms: Dynamic programming + heuristics Applications: BLAST, FASTA, FLASH and others Structure alignment Based on geometrical equivalence of residue positions, residue type disregarded Used for:  protein function analysis  some aspects of evolution studies Algorithms: Dynamic programming, graph theory, MC, geometric hashing and others Applications: DALI, VAST, CE, MASS, SSM and others Sequence and Structure Alignments

EMBL-EBI E. Krissinel & K. Henrick (2004), Acta Cryst. D60, % of identical residues are very often sufficient for chains to be structurally similar Good structure similarity Sequence and Structure Identity

EMBL-EBI Sequence identity within structure families Given that A  B at 20%, B  C at 20%, is A  C at 20% or more? A 20%  20% ? 20% C B Naively, Ok, 20% sequence identity is not a necessary condition for structural similarity. How distant the sequences within a structure family may be?

EMBL-EBI Sequence identity within structure families: case A ABC Aligned residues are structurally conserved through the family. This is a typical assumption for multiple sequence alignment. Implications:  Protein folds are controlled by certain residue types and/or subsequences.  Protein structure and therefore function are clearly sequence- related HIS CYS TRP

EMBL-EBI Sequence identity within structure families: case B Aligned residues are not conserved through the family. Implications:  Protein folds are not controlled by any particular residue types and/or subsequences.  Many different sequences may fold into similar structures  Protein structure and therefore function are not clearly sequence-related ABC HIS CYS TRP

EMBL-EBI ABC This case may be identified by multiple structure alignment only. Multiple sequence alignment will always find and superpose short fragments: HIS CYS TRP -----AFRNEDDDGGKPSTFKL EAARNAF GKKSTFIL EAARNAFDGKMTBIGK Sequence identity within structure families: case B

EMBL-EBI Multiple alignment of SCOP folds SCOP database 11 classes 945 folds 1539 superfamilies 2845 families domains SCOP  Structure-related hierarchy  Manually curated Multiple structure alignment of domains in SCOP folds  Sound structure resemblance within folds  Wide sequence variations  Sequence redundancy cut-off at 50%

EMBL-EBI Sequence identity in SCOP folds Average multiple sequence identity (A)12% Average pairwise sequence identity (B)19% pairwise sequence conservation (case B) multiple sequence conservation (case A) case A case B

EMBL-EBI Residue conservation Odds are calculated as a ratio of observed and expected probabilities to obtain identity residue substitutions: Henikoff, S. and Henikoff, J. G. (1992) Proc. Natl. Acad. Sci. 89, p

EMBL-EBI Reference data from Naor D. et.al. (1996). J. Mol. Biol. 256, p Residue conservation

EMBL-EBI Log odds matrix for SCOP folds Hydropathy index by Kyte, J. and Doolittle, R. F. (1982). J. Mol. Biol. 157, p. 105.

EMBL-EBI Sequence vs “hydropathy” identity in SCOP folds Average pairwise sequence identity19% Average multiple sequence identity12% Average “hydropathy” identity68% hydropathy conservation pairwise sequence conservation (case B) multiple sequence conservation (case A) case A case B

EMBL-EBI What is 20% sequence identity? Consider an idealized model, where all residues are indiscriminately substituted by like-hydropathic residues only : Count matrix 10 hydrophilic residues 10 hydrophobic residues Total counts (in upper triangle) Expected sequence identity

EMBL-EBI Conclusion  it is quite possible that residue identity plays a much less significant role in protein structure than often believed  as a consequence, the role of residue identity in protein function may be often overestimated  using sequence identity for the assessment of structural or functional features may give more false negatives than expected  physical-chemical properties of residues should be given preference over residue identity in structure and function analysis  modern methods for structure alignment are efficient; there is little sense to use sequence alignment in structure-related studies Acknowledgement. This work has been supported by research grant No. 721/B19544 from the Biotechnology and Biological Sciences Research Council (BBSRC) UK.