Presentation is loading. Please wait.

Presentation is loading. Please wait.

Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available 

Similar presentations


Presentation on theme: "Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available "— Presentation transcript:

1 Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available  BLAST, FASTA  PDB  HOMSTRAD  SCOP  Efficient methods are still needed for detecting proteins with similar function and structure

2 SCOP Database  SCOP: Structural Classification of Proteins Class Level Fold Level Superfamily Level Family Level

3 SCOP Database  SCOP: Structural Classification of Proteins Class Level Based on arrangement of secondary structures  all-alpha  all-beta  alpha-and-beta (interspersed)  alpha+beta (segregated)  multidomain

4 SCOP Database  SCOP: Structural Classification of Proteins Class Level Fold Level Same secondary structures, arrangements, topology

5 SCOP Database  SCOP: Structural Classification of Proteins Class Level Fold Level Superfamily Level Structure and function suggest common evolutionary origin

6 SCOP Database  SCOP: Structural Classification of Proteins Class Level Fold Level Superfamily Level Family Level > 30% sequence identity or similar structure/function

7 SCOP Database  Another representation protein family superfamily

8 Classification problem query proteinfunctionally similar  Given a query protein identify functionally similar proteins from a database of known proteins ?

9 Classification problem query proteinfunctionally similar  Given a query protein identify functionally similar proteins from a database of known proteins Support Vector Machines  State-of-the-art methods employ Support Vector Machines (SVM) labeledpositivenegative  Input: Set of labeled data points ( positive or negative )  Output: Model that correctly classifies both the original input data and new unseen data points  SVM finds a hyper-plane that separates the Input Data  The new points are classified with respect to the hyper-plane

10 Support vector machines (SVM) ?

11  Each data point has to be represented as n-dimensional vector feature vector representation  this is called feature vector representation of the data  encodes information about properties of the data  Domain knowledge can/should be used to choose appropriate feature representation SVM and Data representation SVM-based Classifier Input Data Feature Representation SVM Training  Building SVM-based classifier Unseen Data

12 Outline  Related work  article classification sequence  protein classification using sequence information  Proposed method structure  protein classification using structure information  Common thread  vocabulary  vocabulary – a set of possible features  feature vector  feature vector – counts the number of times each feature occurs

13 Article classification  Categorizing Reuters articles (Joachims, 98)  Feature representation of articles  vocabulary  vocabulary is the set of all English words  feature vector  feature vector represents the count of each word in the article Fat doses of red wine extract help obese mice stay healthy A daily glass of red wine was linked to beneficial health effects a decade ago. Long suspected of playing a role in the "French paradox" — a high- fat diet with no ill effects on longevity — resveratrol is found in red wine, sadly in doses about 300 times lower than in the mouse study. 0 computer 2 dose 1 diet 0 felony.... 2 health 0 insurance 0 liquor 2 mouse.... 1 obese 1 paradox 3 red 3 wine

14 LVLHSEGWAKVQLVLHVWAKVE..... Protein classification (sequence)  Categorizing proteins using sequence information (Leslie et al., 04)  Feature representation of proteins  vocabulary  vocabulary is all k-letter words from the amino acid alphabet  feature vector  feature vector represents the count of each “word” in the protein 0 AAAA 0 AAAC 0 AAAD 0 AAAE.... 2 LVLH 0 LVL I 0 LVLK.... 0 WAKS 0 WAKT 2 WAKV....

15 D = D(i, j) = distance between amino acids i and j D = D(i, j) = distance between amino acids i and j Protein classification (structure)  Categorizing proteins using structure information (Ilinkin, Ye, in progress)  Feature representation of proteins  vocabulary  vocabulary is all pairwise distances of k consecutive amino acids  feature vector  feature vector represents the count of each “word” in the protein Protein classification (structure) (3.8, 6.5, 4.1, 3.4, 2.8, 3.7) (3.4, 2.8, 6.4, 3.7, 5.8, 3.1) (3.6, 4.9, 4.8, 3.5, 2.1, 3.5) (3.8, 6.5, 4.1, 3.4, 2.8, 3.7) (3.1, 2.2, 7.0, 3.7, 4.3, 3.6) (3.7, 5.8, 2.8, 3.1, 2.2, 3.7) (3.8, 6.5, 4.1, 3.4, 2.8, 3.7)

16 – – – – – – – – – – – – – – – – – – – – – – – – – –– – – – – – – – – – + + + + + + + ++ + Experimental setup query proteinsuperfamily inout  Given a query protein can we predict its superfamily ( in or out ) + – – – – + + test Classifier Feature Vectors and SVM Training + + + – – – – train positive (in) negative (out)  Split the data into positive (in) and negative (out) examples testingtraining  Reserve some of the data for testing ; rest is for training the SVM – – – – – – – – – – – – – – – – – – – – – – – – – –– – – – – – – – – – + + + + + + + ++ +

17 Results true positive ratefalse positive rate  ROC curve plots true positive rate vs false positive rate ROC score  Area under ROC curve ( ROC score ) is a measure of the quality of classification  area is between 0 and 1 ; closer to 1 is better false positive true positive Sample ROC Curve Experimental Results Area under ROC


Download ppt "Remote homology detection  Remote homologs:  low sequence similarity, conserved structure/function  A number of databases and tools are available "

Similar presentations


Ads by Google