Presentation is loading. Please wait.

Presentation is loading. Please wait.

M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008.

Similar presentations


Presentation on theme: "M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008."— Presentation transcript:

1 M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008

2 O UTLINE Biological Motivation and Background Algorithmic Concepts Mismatch Kernels Semi-supervised methods

3 P ROTEINS

4 T HE P ROTEIN P ROBLEM Primary Structure can be easily determined 3D structure determines function Grouping proteins into structural and evolutionary families is difficult Use machine learning to group proteins

5 H OW TO LOOK AT AMINO ACID CHAINS Smith-Waterman Idea Mismatch Idea

6 F AMILIES Proteins whose evolutionarily relationship is readily recognizable from the sequence (>~25% sequence identity) Families are further subdivided into Proteins Proteins are divided into Species The same protein may be found in several species Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

7 S UPERFAMILIES Proteins which are (remote) evolutionarily related Sequence similarity low Share function Share special structural features Relationships between members of a superfamily may not be readily recognizable from the sequence alone Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

8 F OLDS Proteins which have >~50% secondary structure elements arranged the in the same order in the protein chain and in three dimensions are classified as having the same fold No evolutionary relation between proteins Fold Family Superfamily Proteins Morten Nielsen,CBS, BioCentrum, DTU

9 P ROTEIN C LASSIFICATION Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods Fold Family Superfamily Proteins ? new protein

10 M ACHINE L EARNING C ONCEPTS Supervised Methods Discriminative Vs. Generative Models Transductive Learning Support Vector Machines Kernel Methods Semi-supervised Methods

11 D ISCRIMINATIVE AND G ENERATIVE M ODELS DiscriminativeGenerative

12 T RANSDUCTIVE L EARNING Most Learning is Inductive Given (x 1,y 1 ) …. (x m,y m ), for any test input x* predict the label y* Transductive Learning Given (x 1,y 1 ) …. (x m,y m ) and all the test input {x 1*,…, x p* } predict label {y 1*,…, y p* }

13 S UPPORT V ECTOR M ACHINES Popular Discriminative Learning algorithm Optimal geometric marginal classifier Can be solved efficiently using the Sequential Minimal Optimization algorithm If x 1 … x n training examples, sign(  i i x i T x) “decides” where x falls Train i to achieve best margin

14 S UPPORT V ECTOR M ACHINES (2) Kernalizable: The SVM solution can be completely written down in terms of dot products of the input. {sign(  i i K(x i,x) determines class of x)}

15 K ERNEL M ETHODS K(x, z) = f(x) T f(z) f is the feature mapping x and z are input vectors High dimensional features do not need to be explicitly calculated Think of the kernel function similarity measure between x and z Example:

16 M ISMATCH K ERNEL Regions of similar amino acid sequences yield a similar tertiary structure of proteins Used as a kernel for an SVM to identify protein homologies

17 K - MER BASED SVM S For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches Define normalized mismatch kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) SVM can be learned by supplying this kernel function A B A C A R D I A B R A D A B I X Y K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

18 D ISADVANTAGES 3D structure of proteins is practically impossible Primary sequences are cheap to determine How do we use all this unlabeled data? Use semi-supervised learning based on the cluster assumption

19 S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

20 Some examples are labeled Assume labels vary smoothly among all examples S EMI -S UPERVISED M ETHODS SVMs and other discriminative methods may make significant mistakes due to lack of data

21 S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

22 S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

23 S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

24 S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

25 S EMI -S UPERVISED M ETHODS Some examples are labeled Assume labels vary smoothly among all examples

26 C LUSTER K ERNELS Semi-supervised methods 1. Neighborhood 1. For each X, run PSI-BLAST to get similar seqs  Nbd(X) 2. Define Φ nbd (X) = 1/|Nbd(X)|  X’  Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3. K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y))  X’  Nbd(X)  Y’  Nbd(Y) K(X’, Y’) 2. Next bagged mismatch

27 B AGGED M ISMATCHED K ERNEL Final method 1. Bagged mismatch 1. Run k-means clustering n times, giving p = 1,…,n assignments c p (X) 2. For every X and Y, count up the fraction of times they are bagged together K bag (X, Y) = 1/n  p 1 (c p (X) = c p (Y)) 3. Combine the “bag fraction” with the original comparison K(.,.) K new (X, Y) = K bag (X, Y) K(X, Y)

28 O. Jangmin

29 W HAT WORKS BEST ? Transductive Setting

30 R EFERENCES C. Leslie et al. Mismatch string kernels for discriminative protein classification. Bioinformatics Advance Access. January 22, 2004. J. Weston et al. Semi-supervised protein classification using cluster kernels.2003. Images pulled under wikiCommons


Download ppt "M ACHINE L EARNING FOR P ROTEIN C LASSIFICATION : K ERNEL M ETHODS CS 374 Rajesh Ranganath 4/10/2008."

Similar presentations


Ads by Google