Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New.

Slides:



Advertisements
Similar presentations
Secondary structure prediction from amino acid sequence.
Advertisements

Functional Site Prediction Selects Correct Protein Models Vijayalakshmi Chelliah Division of Mathematical Biology National Institute.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
High Throughput Computing and Protein Structure Stephen E. Hamby.
A Hidden Markov Model for Protein Secondary Structure Prediction
1 Protein Structure, Structure Classification and Prediction Bioinformatics X3 January 2005 P. Johansson, D. Madsen Dept.of Cell & Molecular Biology, Uppsala.
Structural bioinformatics
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Protein Secondary Structures
Strict Regularities in Structure-Sequence Relationship
Garnier-Osguthorpe-Robson
Structure Prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray crystallography, NMR) [2]
Mining frequent patterns in protein structures: A study of protease families Dr. Charles Yan CS6890 (Section 001) ST: Bioinformatics The Machine Learning.
Protein Structure Modeling (1). Protein Folding Problem A protein folds into a unique 3D structure under physiological conditions Lysozyme sequence: KVFGRCELAA.
Protein Secondary Structures Assignment and prediction.
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU April 8, 2003Claus Lundegaard Protein Secondary Structures Assignment and prediction.
Characterization of Secondary Structure of Proteins using Different Vocabularies Madhavi K. Ganapathiraju Language Technologies Institute Advisors Raj.
Carnegie Mellon School of Computer Science Copyright © 2004, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Segmentation Conditional.
Computational Biology, Part 10 Protein Structure Prediction and Display Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Protein Quaternary Fold Recognition Using Conditional Graphical Models
Protein Secondary Structures Assignment and prediction.
Protein Secondary Structures Assignment and prediction Pernille Haste Andersen
Structure Prediction in 1D
CENTER FOR BIOLOGICAL SEQUENCE ANALYSISTECHNICAL UNIVERSITY OF DENMARK DTU October 29, 2004Claus Lundegaard Protein Secondary Structures Assignment and.
Protein structure determination & prediction. Tertiary protein structure: protein folding Three main approaches: [1] experimental determination (X-ray.
Protein Secondary Structures Assignment and prediction.
Predicting local Protein Structure Morten Nielsen.
Template-based Prediction of Protein 8-state Secondary Structures June 12 th 2013 Ashraf Yaseen and Yaohang Li DEPARTMENT OF COMPUTER SCIENCE OLD DOMINION.
Predicting Protein Solvent Accessibility with Sequence, Evolutionary Information and Context-based Features 12/05/2013 Ashraf Yaseen Department of Mathematics.
Lecture 11, CS5671 Secondary Structure Prediction Progressive improvement –Chou-Fasman rules –Qian-Sejnowski –Burkhard-Rost PHD –Riis-Krogh Chou-Fasman.
Rising accuracy of protein secondary structure prediction Burkhard Rost
Prediction to Protein Structure Fall 2005 CSC 487/687 Computing for Bioinformatics.
Predicting Secondary Structure of All-Helical Proteins Using Hidden Markov Support Vector Machines Blaise Gassend, Charles W. O'Donnell, William Thies,
Protein Secondary Structure Prediction with inclusion of Hydrophobicity information Tzu-Cheng Chuang, Okan K. Ersoy and Saul B. Gelfand School of Electrical.
Protein Secondary Structure Prediction Some of the slides are adapted from Dr. Dong Xu’s lecture notes.
Protein Secondary Structure Prediction. Input: protein sequence Output: for each residue its associated Secondary structure (SS): alpha-helix, beta-strand,
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Protein Secondary Structure Prediction
Secondary structure prediction
TMpro: Transmembrane Helix Prediction using Amino Acid Properties and Latent Semantic Analysis Madhavi Ganapathiraju, N. Balakrishnan, Raj Reddy and Judith.
2 o structure, TM regions, and solvent accessibility Topic 13 Chapter 29, Du and Bourne “Structural Bioinformatics”
P ROTEIN SEONDARY & SUPER-SECONDARY STRUCTURE PREDICTION WITH HMM By En-Shiun Annie Lee CS 882 Protein Folding Instructed by Professor Ming Li.
A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.
Web Servers for Predicting Protein Secondary Structure (Regular and Irregular) Dr. G.P.S. Raghava, F.N.A. Sc. Bioinformatics Centre Institute of Microbial.
Protein secondary structure Prediction Why 2 nd Structure prediction? The problem Seq: RPLQGLVLDTQLYGFPGAFDDWERFMRE Pred:CCCCCHHHHHCCCCEEEECCHHHHHHCC.
Protein Secondary Structure Prediction G P S Raghava.
1 Protein Structure Prediction (Lecture for CS397-CXZ Algorithms in Bioinformatics) April 23, 2004 ChengXiang Zhai Department of Computer Science University.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Structure Prediction ● Why ? ● Type of protein structure predictions – Sec Str. Pred – Homology Modelling – Fold Recognition – Ab Initio ● Secondary.
Carnegie Mellon School of Computer Science 1 Protein Quaternary Fold Recognition Using Conditional Graphical Models Yan Liu IBM Research Jaime Carbonell.
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Comparative methods Basic logics: The 3D structure of the protein is deduced from: 1.Similarities between the protein and other proteins 2.Statistical.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Machine Learning Methods of Protein Secondary Structure Prediction Presented by Chao Wang.
Carnegie Mellon School of Computer Science 1 Conditional Graphical Models for Protein Structure Prediction Yan Liu Language Technologies Institute Carnegie.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Improved Protein Secondary Structure Prediction. Secondary Structure Prediction Given a protein sequence a 1 a 2 …a N, secondary structure prediction.
Madhavi Ganapathiraju Graduate student Carnegie Mellon University
Feature Extraction Introduction Features Algorithms Methods
Extra Tree Classifier-WS3 Bagging Classifier-WS3
Introduction to Bioinformatics II
Statistical NLP: Lecture 9
Protein Structure Prediction
Protein structure prediction
Presentation transcript:

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred: A New Method for Protein Secondary Structure Prediction Yan Liu, Jaime Carbonell, Judith Klein- Seetharaman School of Computer Science Carnegie Mellon University May 14, 2003

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Roadmap Overview on secondary structure prediction Description of TXTpred method Experiment results and analysis Discussion and further work

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Secondary Structure of a Protein Sequence Dictionary of Secondary Structure Prediction annotates each residue with its structure (DSSP) –based on hydrogen bonding patterns and geometrical constraints 7 DSSP labels for PSS: –Helix types: H G (alpha-helix 3 / 10 helix) –Sheet types: B E (isolated beta-bridge strand) –Coil types: T _ S (Coil)

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Secondary Structure of a Protein Sequence Accuracy Limit ~ 88%

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Task Definition Given a protein sequence: –APAFSVSPASGA Predict its secondary structure sequence: –CCEEEEECCCCC –Focus on soluble proteins, not on membrane protein

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Overview of Previous Work -1 1 st -generation method –Calculate propensities for each amino acid E.g. Chou-Fasman method (Chou & Fasman, 1974) 2 nd -generation method –“Window” concept APAFSVSPAS (window size = 7) –Calculate propensities for segments of 3-51 amino acids E.g. GOR method (Garnier et al, 1978)

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Overview of Previous Work -2 3 rd -generation method –Use evolutional information multiple sequence alignment p-Value cut-off = PHD: Neural Network & Sequence features only (Rost & Sander, 1993) DSC: LDA & Biological features: GOR, hydrophobicity etc. (King & Sternberg, 1996) –Later Refinement Apply divergent sequence alignment: e.g. PROF ( Ouali & King, 2000 ) Combine results of different system: e.g. Jpred (Cuff & Barton, 1999) Bayesian Segmentation (Schmidler et al, 1999)

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Summary of Performance Method NamePerformance (Q3) Chou-Fasman~ 50% GOR~ 56% PHD~ 71% DSC~ 70%

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Disadvantage of Previous Work Most are “black box” predictors –Weak biological meanings Little focus on long-range interaction –Mostly focused on local information Performance is asymptotically bounded

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Roadmap Overview on secondary structure prediction Description of TXTpred method Experiment results and analysis Discussion and further work

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred Basic idea: –Build meaningful biological vocabulary –Apply language technique for prediction Major challenge: –How to build the vocabulary? Context-free N-gram of amino acids inside the window –Sq: APAFSVSPAS (window = 7) –N-gram: P, A,..,P, PA, AF,..SP, PAF, AFS,..,VSP

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Biological Vocabulary Context sensitive vocabulary –Analogy Same word might have different meanings: e.g. “bank” Same amino acid might have different properties: APAFSVSPAS –Encode context semantics into the N-gram Record the position information in the N-gram Example: APAFSVSPAS (window size = 7) –Words: P-3, A-2, F-1, S+0, V+1, S+1, P+1

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Text Classification Text classification –Analogy The topic of a document is expressed by the words of the document The structure of one residue can be inferred from the biological words nearby –High Accuracy –Text Classification Technique Doc to Vectors: Classifiers: Support Vector Machines

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project TXTpred Method Settings: Window = 17 One-gram, two-gram Feature Num = 3000

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Evaluation Measure Q3 (accuracy) Precision, Recall Segment Overlap quantity (SOV) Matthew’s Correlation coefficients P +P- T +Pu T -on

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Experimental Results RS126 datasets CB513 datasets

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Biological language Properties Power Law? One-gramTwo-gram Term Frequency = f(Rank)

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis -1 Feature Selection Top ten Discriminating features for Helix Verification by Chou- Fasman parameters –Helix favors A, E, M, L, K (top 5 amino acids) –disfavors P (top 1 amino acid)

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis -1 Feature Selection Top ten Discriminating features for Sheet Verification by Chou- Fasman parameters –Sheets favors V, I, Y, F, W (top 5 amino acids) – Disfavors D, E (top 2 amino acids)

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis -1 Feature Selection Top ten Discriminating features for Coil Verification by Chou- Fasman parameters –Coil favors N, P, G, D, S (top 5 amino acids) –Disfavors V, I, L (top 3 amino acids)

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis –2 Word Correlation Word correlation Some words have strong correlation and co- occur frequently Technique: Singular Vector Decomposition Examples from texts Phrases: {president, Bush} Semantic correlated: {Olympic, sports}

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis – 2 Word Correlation Top ten correlated word pairs

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Sequence Analysis – 2 Word Correlation Regular Expression Protein Sequence Secondary Structure Conjecture CPXXAISq1:ECPNEAIM Sq2:ECPAEAIK Sq3:GCPI PAIL L1: HCCCCCEC L2: HCCCCCEE L3: CCCCCEEE Coil connected to Sheet PGHSq1: TFPGHSA Sq2: DCPGHAD L1: CCCCCCC L2: ECCCHHH Coil EELSq1: DDEELLE Sq2: WSEELNS L1:CCHHHHH L2:CCHHHHH Helix

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Conclusion TXTpred Summary –Context sensitive biological vocabulary –Novel application of text classification to secondary structure prediction –Comparable performance for secondary structure prediction –Analysis provides reasonable biological meanings and structure indicators

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Future Work Deeper study on extracting more meaningful biological vocabulary Further discovery of new features, such as torsion angle and free energy Advanced learning models to consider long-range interactions Conditional random fields, Maximum entropy markov model

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Acknowledgement Vanathi Gopalakrishnan, Upitt Ivet Barhar, UPitt

Carnegie Mellon School of Computer Science Copyright © 2003, Carnegie Mellon. All Rights Reserved. Biological Language Modeling Project Motivation for 2-D prediction Basis for three-dimensional structure prediction Improving other sequence and structure analysis –Sequence alignment –Threading and homologous modeling –Experimental data –Protein design