Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 

Slides:



Advertisements
Similar presentations
PROTEOMICS 3D Structure Prediction. Contents Protein 3D structure. –Basics –PDB –Prediction approaches Protein classification.
Advertisements

Pfam(Protein families )
Hidden Markov models for detecting remote protein homologies Kevin Karplus, Christian Barrett, Richard Hughey Georgia Hadjicharalambous.
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Structural bioinformatics
Mismatch string kernels for discriminative protein classification By Leslie. et.al Presented by Yan Wang.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Machine Learning for Protein Classification Ashutosh Saxena CS 374 – Algorithms in Biology Thursday, Nov 16, 2006.
Protein structure (Part 2 of 2).
Author: Jason Weston et., al PANS Presented by Tie Wang Protein Ranking: From Local to global structure in protein similarity network.
Protein Fold recognition Morten Nielsen, Thomas Nordahl CBS, BioCentrum, DTU.
The Protein Data Bank (PDB)
CISC667, F05, Lec20, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Protein Structure Prediction Protein Secondary Structure.
CISC667, F05, Lec23, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines (II) Bioinformatics Applications.
Protein Homology Detection Using String Alignment Kernels Jean-Phillippe Vert, Tatsuya Akutsu.
Protein structure Classification Ole Lund, Associate professor, CBS, DTU.
PDB-Protein Data Bank SCOP –Protein structure classification CATH –Protein structure classification genTHREADER–3D structure prediction Swiss-Model–3D.
Protein Structure Prediction II
Detecting the Domain Structure of Proteins from Sequence Information Niranjan Nagarajan and Golan Yona Department of Computer Science Cornell University.
Remote Homology detection: A motif based approach CS 6890: Bioinformatics - Dr. Yan CS 6890: Bioinformatics - Dr. Yan Swati Adhau Swati Adhau 04/14/06.
Protein Tertiary Structure Prediction Structural Bioinformatics.
Protein Structures.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.
Protein Tertiary Structure Prediction
Identifying Computer Graphics Using HSV Model And Statistical Moments Of Characteristic Functions Xiao Cai, Yuewen Wang.
SUPERVISED NEURAL NETWORKS FOR PROTEIN SEQUENCE ANALYSIS Lecture 11 Dr Lee Nung Kion Faculty of Cognitive Sciences and Human Development UNIMAS,
Automatic methods for functional annotation of sequences Petri Törönen.
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
PDBe-fold (SSM) A web-based service for protein structure comparison and structure searches Gaurav Sahni, Ph.D.
LSM3241: Bioinformatics and Biocomputing Lecture 3: Machine learning method for protein function prediction Prof. Chen Yu Zong Tel:
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
CATH – a hierarchic classification of protein domain structures Rui Kuang.
PROTEIN STRUCTURE CLASSIFICATION SUMI SINGH (sxs5729)
A Study of Residue Correlation within Protein Sequences and its Application to Sequence Classification Christopher Hemmerich Advisor: Dr. Sun Kim.
Protein Classification II CISC889: Bioinformatics Gang Situ 04/11/2002 Parts of this lecture borrowed from lecture given by Dr. Altman.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Meng-Han Yang September 9, 2009 A sequence-based hybrid predictor for identifying conformationally ambivalent regions in proteins.
Protein Tertiary Structure. Protein Data Bank (PDB) Contains all known 3D structural data of large biological molecules, mostly proteins and nucleic acids:
Study of Protein Prediction Related Problems Ph.D. candidate Le-Yi WEI 1.
Protein Classification Using Averaged Perceptron SVM
PREDICTION OF CATALYTIC RESIDUES IN PROTEINS USING MACHINE-LEARNING TECHNIQUES Natalia V. Petrova (Ph.D. Student, Georgetown University, Biochemistry Department),
Protein Structure Prediction: Homology Modeling & Threading/Fold Recognition D. Mohanty NII, New Delhi.
DDPIn Distance and Density Based Protein Indexing David Hoksza Charles University in Prague Department of Software Engineering Czech Republic.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Combining Evolutionary Information Extracted From Frequency Profiles With Sequence-based Kernels For Protein Remote Homology Detection Name: ZhuFangzhi.
Structural classification of Proteins SCOP Classification: consists of a database Family Evolutionarily related with a significant sequence identity Superfamily.
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
9.913 Pattern Recognition for Vision Class9 - Object Detection and Recognition Bernd Heisele.
Ubiquitination Sites Prediction Dah Mee Ko Advisor: Dr.Predrag Radivojac School of Informatics Indiana University May 22, 2009.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.
An Efficient Index-based Protein Structure Database Searching Method 陳冠宇.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Predicting Structural Features Chapter 12. Structural Features Phosphorylation sites Transmembrane helices Protein flexibility.
Using the Fisher kernel method to detect remote protein homologies Tommi Jaakkola, Mark Diekhams, David Haussler ISMB’ 99 Talk by O, Jangmin (2001/01/16)
BIOINFORMATION A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation - - 王红刚 14S
Chapter 14 Protein Structure Classification
Have (y)Our Protein Explained
Combining HMMs with SVMs
Protein Structures.
Prediction of protein function from sequence analysis
Protein structure prediction.
Protein Structural Classification
Presentation transcript:

Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available  BLAST, FASTA  PDB  HOMSTRAD  SCOP  Efficient methods are still needed for detecting proteins with similar function and structure

SCOP Database  SCOP: Structural Classification of Proteins Class Level Fold Level Superfamily Level Family Level

SCOP Database  SCOP: Structural Classification of Proteins Class Level Based on arrangement of secondary structures  all-alpha  all-beta  alpha-and-beta (interspersed)  alpha+beta (segregated)  multidomain

SCOP Database  SCOP: Structural Classification of Proteins Class Level Fold Level Same secondary structures, arrangements, topology

SCOP Database  SCOP: Structural Classification of Proteins Class Level Fold Level Superfamily Level Structure and function suggest common evolutionary origin

SCOP Database  SCOP: Structural Classification of Proteins Class Level Fold Level Superfamily Level Family Level > 30% sequence identity or similar structure/function

SCOP Database  Another representation protein family superfamily

Classification problem query proteinfunctionally similar  Given a query protein identify functionally similar proteins from a database of known proteins ?

Classification problem query proteinfunctionally similar  Given a query protein identify functionally similar proteins from a database of known proteins Support Vector Machines  State-of-the-art methods employ Support Vector Machines (SVM) labeledpositivenegative  Input: Set of labeled data points ( positive or negative )  Output: Model that correctly classifies both the original input data and new unseen data points  SVM finds a hyper-plane that separates the Input Data  The new points are classified with respect to the hyper-plane

Support vector machines (SVM) ?

 Each data point has to be represented as n-dimensional vector feature vector representation  this is called feature vector representation of the data  encodes information about properties of the data  Domain knowledge can/should be used to choose appropriate feature representation SVM and Data representation SVM-based Classifier Input Data Feature Representation SVM Training  Building SVM-based classifier Unseen Data

Outline  Related work  article classification sequence  protein classification using sequence information  Proposed method structure  protein classification using structure information  Common thread  vocabulary  vocabulary – a set of possible features  feature vector  feature vector – counts the number of times each feature occurs

Article classification  Categorizing Reuters articles (Joachims, 98)  Feature representation of articles  vocabulary  vocabulary is the set of all English words  feature vector  feature vector represents the count of each word in the article Fat doses of red wine extract help obese mice stay healthy A daily glass of red wine was linked to beneficial health effects a decade ago. Long suspected of playing a role in the "French paradox" — a high- fat diet with no ill effects on longevity — resveratrol is found in red wine, sadly in doses about 300 times lower than in the mouse study. 0 computer 2 dose 1 diet 0 felony health 0 insurance 0 liquor 2 mouse obese 1 paradox 3 red 3 wine

LVLHSEGWAKVQLVLHVWAKVE..... Protein classification (sequence)  Categorizing proteins using sequence information (Leslie et al., 04)  Feature representation of proteins  vocabulary  vocabulary is all k-letter words from the amino acid alphabet  feature vector  feature vector represents the count of each “word” in the protein 0 AAAA 0 AAAC 0 AAAD 0 AAAE LVLH 0 LVL I 0 LVLK WAKS 0 WAKT 2 WAKV....

D = D(i, j) = distance between amino acids i and j D = D(i, j) = distance between amino acids i and j Protein classification (structure)  Categorizing proteins using structure information (Ilinkin, Ye, in progress)  Feature representation of proteins  vocabulary  vocabulary is all pairwise distances of k consecutive amino acids  feature vector  feature vector represents the count of each “word” in the protein Protein classification (structure) (3.8, 6.5, 4.1, 3.4, 2.8, 3.7) (3.4, 2.8, 6.4, 3.7, 5.8, 3.1) (3.6, 4.9, 4.8, 3.5, 2.1, 3.5) (3.8, 6.5, 4.1, 3.4, 2.8, 3.7) (3.1, 2.2, 7.0, 3.7, 4.3, 3.6) (3.7, 5.8, 2.8, 3.1, 2.2, 3.7) (3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

– – – – – – – – – – – – – – – – – – – – – – – – – –– – – – – – – – – – Experimental setup query proteinsuperfamily inout  Given a query protein can we predict its superfamily ( in or out ) + – – – – + + test Classifier Feature Vectors and SVM Training – – – – train positive (in) negative (out)  Split the data into positive (in) and negative (out) examples testingtraining  Reserve some of the data for testing ; rest is for training the SVM – – – – – – – – – – – – – – – – – – – – – – – – – –– – – – – – – – – –

Results true positive ratefalse positive rate  ROC curve plots true positive rate vs false positive rate ROC score  Area under ROC curve ( ROC score ) is a measure of the quality of classification  area is between 0 and 1 ; closer to 1 is better false positive true positive Sample ROC Curve Experimental Results Area under ROC