Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.

Slides:



Advertisements
Similar presentations
Assignment of PROSITE motifs to topological regions: Application to a novel database of well characterised transmembrane proteins Tim Nugent.
Advertisements

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.
Using phylogenetic profiles to predict protein function and localization As discussed by Catherine Grasso.
1 Genome information GenBank (Entrez nucleotide) Species-specific databases Protein sequence GenBank (Entrez protein) UniProtKB (SwissProt) Protein structure.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
Profiles for Sequences
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Heuristic alignment algorithms and cost matrices
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Identifying functional residues of proteins from sequence info Using MSA (multiple sequence alignment) - search for remote homologs using HMMs or profiles.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Methods for Improving Protein Disorder Prediction Slobodan Vucetic1, Predrag Radivojac3, Zoran Obradovic3, Celeste J. Brown2, Keith Dunker2 1 School of.
The Domain Structure of Proteins: Prediction and Organization. Golan Yona Dept. of Computer Science Cornell University (joint work with Niranjan Nagarajan)
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
Introduction to Bioinformatics - Tutorial no. 8 Predicting protein structure PSI-BLAST.
Comparing Database Search Methods & Improving the Performance of PSI-BLAST Stephen Altschul.
Blast heuristics Morten Nielsen Department of Systems Biology, DTU.
Protein Structures.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Protein Tertiary Structure Prediction
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Protein domains. Protein domains are structural units (average 160 aa) that share: Function Folding Evolution Proteins normally are multidomain (average.
Protein Sequence Alignment and Database Searching.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Eric C. Rouchka, University of Louisville SATCHMO: sequence alignment and tree construction using hidden Markov models Edgar, R.C. and Sjolander, K. Bioinformatics.
Multiple Alignments Motifs/Profiles What is multiple alignment? HOW does one do this? WHY does one do this? What do we mean by a motif or profile? BIO520.
Last lecture summary. Window size? Stringency? Color mapping? Frame shifts?
Discovering the Correlation Between Evolutionary Genomics and Protein-Protein Interaction Rezaul Kabir and Brett Thompson
Comp. Genomics Recitation 3 The statistics of database searching.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Multiple Mapping Method with Multiple Templates (M4T): optimizing sequence-to-structure alignments and combining unique information from multiple templates.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Manually Adjusting Multiple Alignments Chris Wilton.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Blast 2.0 Details The Filter Option: –process of hiding regions of (nucleic acid or amino acid) sequence having characteristics.
Matching Protein  -Sheet Partners by Feedforward and Recurrent Neural Network Proceedings of Eighth International Conference on Intelligent Systems for.
Step 3: Tools Database Searching
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright  2000, All rights reserved.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
METHOD: Family Classification Scheme 1)Set for a model building: 67 microbial genomes with identified protein sequences (Table 1) 2)Set for a model.
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Demo: Protein Information Resource
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Sequence Based Analysis Tutorial
Protein domains Miguel Andrade Mainz, Germany Faculty of Biology,
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University

What’s and Why’s Why? Function Prediction Improved Alignments and more accurate Evolutionary Studies Protein Design What? Delineating Sequence Contiguous Domains Work exclusively on Sequence Information

Past Work The Pfam Protein Families Database, Bateman et al (2002) Nucleic Acids Research 30: ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons, Corpet et al (2000) Nucleic Acids Research 28: Automated Protein database classification: I. Integration of compositional similarity search, local similarity search and multiple sequence alignment. II. Delineation of domain boundaries from sequence similarities, Jerome et al (1998) Bioinformatics 14:

Overview of the Process Seed Sequence Multiple Alignment blast search Neural Network Correlation Entropy Sequence Participation Contact Profile Secondary Structure Physio-Chemical Properites Final Predictions

Motivation Simple and Extensible Tests an array of novel sources of information Automated method based on statistical analysis of the scores Domain transition signals are learned rather than programmed in

Score Design Efficiently Computable Yields single value per profile column Robustness to Alignment inaccuracies Useful in distinguishing in-domain from out-domain columns in isolation or in combination with other scores

Correlation Measures the conservation of the alignment in a region High CorrelationLow Correlation

Entropy Estimates the diversity of the amino-acid distribution for a column Low EntropyHigh Entropy

Sequence Participation Identifies and quantifies the significance of regions where there is a major change in sequence participation

Secondary Structure Uses psipred secondary structure predictions for the seed sequence

Contact Profile Contacts are predicted based on correlated mutation values that are significantly larger than random values

Physio-Chemical Properties We tested properties like Hydrophobicity, Molecular Weight, and Charge and various classifications of the amino acids for their information content Scores were calculated by: Using the classification to assign values in the range [0, 1] to every residue Taking the average of the values for a profile column

Generating the Data Set Seed Sequences: 4810 non-redundant (95% identity) PDB sequences that are at least 40 amino acids long (PDB data as of may 2002) Alignments: The seeds were blasted against a composite non-redundant database with 693,912 non-fragmented entries The resulting hits were compiled in a database The seeds were queried using PSI-BLAST (until convergence) against these smaller databases to generate the alignment Domain Definitions: Definitions in SCOP 1.57 were used (seeds with inconsistent definitions or less than 90% coverage were removed) The final set, after filtering to ensure to ensure a balance in the number of single (576) and multi-domain (605) proteins, contained 1181 seed proteins and their alignments

Massaging and Optimizing the Scores Scores were smoothed over various smoothing windows to test the importance of evening out local fluctuations Scores were normalized to ensure that values from different proteins were comparable The size of the smoothing window was optimized using the Jensen-Shannon Divergence between the distributions for in-domain and out-domain columns

Designing and Training the Neural Network Matlab’s Neural Network Toolbox was used to design and train networks Network Properties: Feed-Forward Back Propagation network with Tangent Sigmoid activation function Current best network takes in 11 inputs and has two hidden layers with 10 and 5 neurons respectively Neural network trained on a set of 484 proteins with a validation set of 237 proteins and test set of 460 proteins Best network has accuracy of 91% for in-domain and 70% for out-domain columns in test set

From Neural Network to Cutpoint Predictions A column is predicted as a cutpoint if a significant fraction of columns in a window centered at it are predicted as being out- domain For regions with multiple cutpoints near one another, minimas of the smoothed prediction curve are used to decide the most suitable cutpoint

Comparative Results Accuracy evaluates predictions with respect to the true definitions Sensitivity evaluates true definitions with respect to the definitions

Examples Seed Number: 9847 PDB ID: 1b6s chain D Domain Definition:1-78, , Predicted Cutpoints: 73, 271 PFam Definition:

More Examples Seed Number: PDB ID: 1acc Domain Definition: Predicted Cutpoints: 158, 583 PFam Definition:

Highlights Correctly predicts domain definitions for 237 (52%) of the proteins in the test set thus comparing favorably with PFam (258 and 56%) The procedure is simple and fast and comparable in accuracy and coverage to PFam General purpose method for delineating domain boundaries that relies solely on sequence information