The Domain Structure of Proteins: Prediction and Organization. Golan Yona Dept. of Computer Science Cornell University (joint work with Niranjan Nagarajan)

Slides:



Advertisements
Similar presentations
Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
Advertisements

Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Profiles for Sequences
Hidden Markov Models Theory By Johan Walters (SR 2003)
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Heuristic alignment algorithms and cost matrices
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Identification of Domains using Structural Data Niranjan Nagarajan Department of Computer Science Cornell University.
Class 3: Estimating Scoring Rules for Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Prediction of Local Structure in Proteins Using a Library of Sequence-Structure Motifs Christopher Bystroff & David Baker Paper presented by: Tal Blum.
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Chapter 5 Multiple Sequence Alignment.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple testing correction
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Hidden Markov Models for Sequence Analysis 4
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Comp. Genomics Recitation 3 The statistics of database searching.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Manually Adjusting Multiple Alignments Chris Wilton.
1 Improve Protein Disorder Prediction Using Homology Instructor: Dr. Slobodan Vucetic Student: Kang Peng.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
John Lafferty Andrew McCallum Fernando Pereira
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Sequence similarity, BLAST alignments & multiple sequence alignments
Chapter 14 Protein Structure Classification
Department of Computer Science
Pairwise alignment incorporating dipeptide covariation
Presentation transcript:

The Domain Structure of Proteins: Prediction and Organization. Golan Yona Dept. of Computer Science Cornell University (joint work with Niranjan Nagarajan) Golan Yona, Cornell University

PDB: 1a8y 367aa long MKIIRIETSRIAVPLTKPFKTALRTVYTAESVIVRITYDSGAVGWGEAPPTLVITGDSM…………

Golan Yona, Cornell University The domain structure of a protein A domain is considered the fundamental unit of protein structure, folding, function, evolution and design. Compact Stable Folds independently? Has a specific function

Golan Yona, Cornell University A protein is a combination of domains Protein1 Protein2 Protein3

Golan Yona, Cornell University Any signals that might indicate domain boundaries? A very weak signal if any in the sequence Usually domain delineation is done based on structure Best methods available – manual! But structural information is sparse..

Golan Yona, Cornell University Definitions and assumptions Domain: continuous sequence that corresponds to an elemental building block of protein folds. A subsequence that is likely to be stable as an independent folding unit. Was formed as an independent unit, and later was combined with others – more complex functions. There are traces of the autonomous units..

Golan Yona, Cornell University First step.. Gather data – database search Histogram of matches is informative but noisy Mutations, insertions, deletions, conflicting evidence sequence

Golan Yona, Cornell University Previous methods Methods based on the use of similarity searches and knowledge of sequence termini to delineate domain boundaries using heuristics/rules (MKDOM, Domainer, DIVCLUS, DOMO). Methods that rely on expert knowledge of protein families to construct models like HMMs to identify other members of the family (Pfam, TigrFam, SMART). Methods that try to infer domain boundaries by using sequence information to predict tertiary structure first (SnapDragon. Rigden’s covariance analysis) Methods that use multiple alignments to predict domain boundaries (PASS, Domination). Others..(e.g. CSA and DGS = guess based on size)

Golan Yona, Cornell University How do you evaluate the different methods? No universal measures A variety of qualitative and quantitative evaluation criteria, external resources and manual analysis are used to verify domain boundaries

Golan Yona, Cornell University Method outline Source/test data – SCOP Processed data - alignments Learning system: –Domain-information-content scores –NN –Probabilistic model Evaluation “A Multi-Expert System for the Automatic Detection of Protein Domains from Sequence Information” Niranjan Nagaragan and Golan Yona, in the proceedings of RECOMB2003

Golan Yona, Cornell University Overview Seed Sequence Multiple Alignment blast search Neural Network Correlation Entropy Sequence Participation Contact Profile Secondary Structure Physio-Chemical Properties Final Predictions DNA DATA Intron Boundaries

Golan Yona, Cornell University The source/test data set PDB structures with their partitions into domains as defined in SCOP: –1ctf: domain domain Remove sequences shorter than 40 aa and almost identical entries

Golan Yona, Cornell University Alignments Search each query against a database of ~1 million non-redundant sequences Remove fragments first Two phase alignment procedure –First phase: blast –Second phase: multiple iteration psi-blast Select one representative from each group of similar proteins Remove proteins that are less than 90% covered (missing information) Number of domains ranging from 1-7 Final set: 605 multi-domain proteins and 576 single domain proteins (1/4)

Golan Yona, Cornell University The domain-information-content of an alignment column Measures that (are believed) to reflect structural properties of proteins A total of 20 measures –Conservation measures –Consistency and correlation measures –Measures of structural flexibility –Residue type based measures –Predicted secondary structure information –Intron-exon data

Golan Yona, Cornell University Conservation measures Entropy: some positions are more conserved than others Class entropy: some positions have preference towards a class of amino-acids (similar physio- chemical properties) Evolutionary pressure (span): sum of pairwise similarities Motivation: consider the mutual similarity of amino acids

Golan Yona, Cornell University Consistency and correlation measures All domain appearances should maintain its integrity Consistency: difference in sequence counts Asymmetric correlation: consistency of individual sequences. Symmetric correlation: reinforcement by missing sequences Measures are averaged over a window

Golan Yona, Cornell University Consistency and correlation measures – cont. Sequence termination: strong but elusive –Fragments –Premature halt in alignment –Loosely aligned Product of left and right termination scores: given c sequences that terminate at a position, with evalues e 1,e 2,e 3,…e c

Golan Yona, Cornell University

Measures of structural flexibility Indel entropy: variability indicates structural flexibility (likely to occur near domain boundaries) Correlated mutations: indicative of contacts Contact profiles

Golan Yona, Cornell University Contact profile

Golan Yona, Cornell University Residue type based measures hydrophobic vs. hydrophilic cystines and prolines Classes of amino acids Predicted secondary structures Helices and strands are rigid Loops are more abundant near domain boundaries

Golan Yona, Cornell University Intron-exon data Exon boundaries are expected to coincide with domain boundaries Protein1 Protein2 Protein3

Golan Yona, Cornell University Score refinement and normalization Smoothing using a window w (optimized) Unification to a single scale – zscore over all positions

Golan Yona, Cornell University Maximizing the information content of scores Opt for the most distinct distributions of domain positions vs. boundary positions Affected by the parameters (w smoothing factor) and x (boundary window size) Use the Jensen-Shannon divergence measure

Golan Yona, Cornell University Examples

Golan Yona, Cornell University Even measures with identical distributions may be informative in a mutli-variate model To simplify model only the top 12 are selected

Golan Yona, Cornell University The learning system A neural network is trained to model effectively the complex decision boundary surface Predicts correctly 94% of domain positions and 88% of the transitions in the test set Also tried mapping from multiple positions (local input neighborhood) to single/multiple output

Golan Yona, Cornell University Overview Seed Sequence Multiple Alignment blast search Neural Network Correlation Entropy Sequence Participation Contact Profile Secondary Structure Physio-Chemical Properties Final Predictions DNA DATA Intron Boundaries

Golan Yona, Cornell University Hypothesis evaluation Simple model: refine predictions –Significant fraction of the positions in a window centered at x should be predicted as transitions –Order transitions by their quality (depth of the minima) and reject all transitions that are within 30 residues from already predicted transitions

Golan Yona, Cornell University The domain generator model Multiple hypotheses – find the “best one” Assume a model: random generator that moves repeatedly between a domain state and a linker state and emits one domain or transition at a time according to different source probability distributions. Total probability is the product

Golan Yona, Cornell University Formally.. S = D1 D2Dn We are given a sequence S (multiple alignment) of length L and a possible partition into n domains D=D 1,D 2,..D n of lengths l 1,l 2,..,l n (NN output) Find the partition that will maximize the posterior probability P(D/S) Maximize the product of the likelihood and the prior

Golan Yona, Cornell University Calculating the prior P(D) For an arbitrary protein of length L what is the probability to observe D Approximate using a simplified model: given the length of the protein, the generator selects the number of domains first and then selects the length of one domain at a time, considering the domains that were already generated.

Golan Yona, Cornell University The prior probabilities Approximate P 0 (l i /L) by P 0 (l i ) normalized to the relevant range. P 0 (l i /L) is derived based on experimental data

Golan Yona, Cornell University The prior probabilities (cont.) Calculate Prob(n/L) = Prob(n,L)/P(L) 1 2

Golan Yona, Cornell University The likelihood Use probabilities of observed scores considering the two different sources The model D partitions the sequence S into n domains and n-1 transitions: D 1,T 1,D 2,T 2,…,T n- 1,D n that correspond to the subsequences s 1,t 1,s 2,t 2,..,t n-1,s n Assume domains are independent of each other (additional test can be used)

Golan Yona, Cornell University …likelihood Each term P(s i /D i ) and P(t j /T j ) is a product over the probabilities of the individual positions, each one is estimated by the joint probability distribution of the 12 features How to estimate this probability? (independence assumption does not hold)

Golan Yona, Cornell University

Likelihood of individual position Given k random variables X 1,X 2,..,X k their joint prob. Distribution Use first order dependencies For each pair, calculate the distance between the joint prob. Distribution and the product of the marginal distributions

Golan Yona, Cornell University Sort all pairs based on their dependency, and pick the most dependent one (denoted by Y1, Y2) and start the expansion Select the next one based on the strongest dependency with variables that are already in the expansion

Golan Yona, Cornell University Denote by Z=PILLAR(Y) the random variable that Y is most dependent on Of all possible dependencies involving Y3 pick P(Y3/Z) and add it to the expansion Proceed until you exhaust all variables Maximize support, minimize error The expansion is different for domain and transition regions

Golan Yona, Cornell University Finally.. Enumerate all possible hypotheses, calculate the posterior probability for each one, and output the one that maximizes the prob.

Golan Yona, Cornell University Summary of results Distance accuracy: average distance of the predicted transitions from their associated SCOP transition points. Distance sensitivity: average distance of SCOP transitions from their associated predicted transition points. Selectivity: percentage of correct predictions (within 10 residues from SCOP transitions) Coverage: percentage of correctly identified SCOP transitions (within 10 residues from predicted transitions)

Golan Yona, Cornell University Examples PDB ID: 2gep Domain Definition: 8-72, , , Predicted Domains: 1-75, , , PFam Definition: 1-67, ,

Golan Yona, Cornell University Examples PDB ID: 1b6s chain D Domain Definition: 1-78, , Predicted Domains: 1-73, , PFam Definition:

Golan Yona, Cornell University Examples PDB ID: 1acc Domain Definition: Predicted Domains: 1-158, , PFam Definition:

Golan Yona, Cornell University Conclusions A method for predicting the domain structure of a protein from sequence information alone Protein/DNA data, multiple features, optimization based on information theory principles, learning system and final prediction using the domain-generator model (with confidence values). Exhaustive hypothesis evaluation Fully automatic and fast Perform very well even compared to the best manual and semi-manual methods out there (also on CATH data) Dare to say …can be used to verify domain assignments based on structural data Improvements: other learning systems, more features

Golan Yona, Cornell University Acknowledgments Niranjan Nagarajan SCOP CATH PSI-BLAST Pfam InterPro NSF