Presentation is loading. Please wait.

Presentation is loading. Please wait.

Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST.

Similar presentations


Presentation on theme: "Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST."— Presentation transcript:

1 Protein Classification

2 Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST Profile HMMs Supervised Machine Learning methods Fold Family Superfamily Proteins ? new protein

3 PSI-BLAST Given a sequence query x, and database D 1.Find all pairwise alignments of x to sequences in D 2.Collect all matches of x to y with some minimum significance 3.Construct position specific matrix M Each sequence y is given a weight so that many similar sequences cannot have much influence on a position (Henikoff & Henikoff 1994) 4.Using the matrix M, search D for more matches 5.Iterate 1–4 until convergence Profile M

4 Classification with Profile HMMs Fold Family Superfamily ? new protein

5 The Fisher Kernel Fisher score  U X =   log P(X | H 1,  )  Quantifies how each parameter contributes to generating X  For two different sequences X and Y, can compare U X, U Y D 2 F (X, Y) = ½  2 |U X – U Y | 2 Given this distance function, K(X, Y) is defined as a similarity measure:  K(X, Y) = exp(-D 2 F (X, Y))  Set  so that the average distance of training sequences X i  H 1 to sequences X j  H 0 is 1

6 The Fisher Kernel To train a classifier for a given family H 1, 1.Build profile HMM, H 1 2.U X =   log P(X | H 1,  )(Fisher score) 3.D 2 F (X, Y) = ½  2 |U X – U Y | 2 (distance) 4.K(X, Y) = exp(-D 2 F (X, Y)), (akin to dot product) 5.L(X) =  Xi  H1 i K(X, X i ) –  Xj  H0 j K(X, X j ) 6.Iteratively adjust to optimize J( ) =  Xi  H1 i (2 - L(X i )) –  Xj  H0 j (2 + L(X j )) To classify query X,  Compute U X  Compute K(X, X i ) for all training examples X i with I ≠ 0 (few)  Decide based on L(X) >? 0

7 O. Jangmin

8

9 QUESTION Running time of Fisher kernel SVM on query X?

10 k-mer based SVMs Leslie, Eskin, Weston, Noble; NIPS 2002 Highlights K(X, Y) = exp(-½  2 |U X – U Y | 2 ), requires expensive profile alignment: U X =   log P(X | H 1,  ) – O(|X| |H 1 |) Instead, new kernel K(X, Y) just “counts up” k-mers with mismatches in common between X and Y – O(|X|) in practice Off-the-shelf SVM software used

11 k-mer based SVMs For given word size k, and mismatch tolerance l, define K(X, Y) = # distinct k-long word occurrences with ≤ l mismatches Define normalized kernel K’(X, Y) = K(X, Y)/ sqrt(K(X,X)K(Y,Y)) SVM can be learned by supplying this kernel function A B A C A R D I A B R A D A B I X Y K(X, Y) = 4 K’(X, Y) = 4/sqrt(7*7) = 4/7 Let k = 3; l = 1

12 SVMs will find a few support vectors v After training, SVM has determined a small set of sequences, the support vectors, who need to be compared with query sequence X

13 Benchmarks

14 Semi-Supervised Methods GENERATIVE SUPERVISED METHODS

15 Semi-Supervised Methods DISCRIMINATIVE SUPERVISED METHODS

16 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

17 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

18 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

19 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

20 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

21 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

22 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

23 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

24 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

25 Semi-Supervised Methods UNSUPERVISED METHODS Mixture of Centers Data generated by a fixed set of centers (how many?)

26 Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

27 Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples SVMs and other discriminative methods may make significant mistakes due to lack of data

28 Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

29 Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

30 Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

31 Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples Attempt to “contract” the distances within each cluster while keeping intracluster distances larger

32 Semi-Supervised Methods Some examples are labeled Assume labels vary smoothly among all examples

33 Semi-Supervised Methods 1.Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005Kuang, Ie, Wang, Siddiqi, Freund, Leslie 2005  A Psi-BLAST profile—based method 2.Weston, Leslie, Elisseeff, Noble, NIPS 2003Weston, Leslie, Elisseeff, Noble, NIPS 2003  Cluster kernels

34 (semi)1. Profile k-mer based SVMs For each sequence X,  Obtain PSI-BLAST profile Q(X) = {p i (  );  : amino acid, 1≤ i ≤ |X|}  For every k-mer in X, x j … x j+k-1, define  -neighborhood M k,  (Q[x j …x j+k-1 ]) = {b 1 …b k | -  i=0…k-1 log p j+i (b i ) <  }  Define K(X, Y) For each b 1 …b k matching m times in X, n times in Y, add m*n In practice, each k-mer can have ≤ 2 mismatches and K(X, Y) can be computed quickly in O(k 2 20 2 (|X| + |Y|)) Profile M PSI-BLAST

35 (semi)1. Discriminative motifs According to this kernel K(X, Y), sequence X is mapped to Φ k,  (X): vector in 20 k dimensions  Φ k,  (X)(b 1 …b k ) = # k-mers in Q(X) whose neighborhood includes b 1 …b k Then, SVM learns a discriminating “hyperplane” with normal vector v:  v =  i=1…N (+/-) i Φ k,  (X (i) ) Consider a profile k-mer Q[x j …x j+k-1 ]; its contribution to v is ~   Φ k,  (Q[x j …x j+k-1 ]), v  Consider a position i in X: count up the contributions of all words containing x i  g(x i ) =  j=1…k max{ 0,  Φ k,  (Q[x i-k+j …x j-1+j ]), v  }  Sort these contributions within all positions of all sequences, to pick important positions or discriminative motifs

36 (semi)1. Discriminative motifs Consider a position i in X: count up the contributions to v of all words containing x i  Sort these contributions within all positions of all sequences, to pick discriminative motifs

37 (semi)2. Cluster Kernels Two (more!) methods 1.Neighborhood 1.For each X, run PSI-BLAST to get similar seqs  Nbd(X) 2.Define Φ nbd (X) = 1/|Nbd(X)|  X’  Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3.K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y))  X’  Nbd(X)  Y’  Nbd(Y) K(X’, Y’) 2.Bagged mismatch

38 (semi)2. Cluster Kernels Two (more!) methods 1.Neighborhood 1.For each X, run PSI-BLAST to get similar seqs  Nbd(X) 2.Define Φ nbd (X) = 1/|Nbd(X)|  X’  Nbd(X) Φ original (X’) “Counts of all k-mers matching with at most 1 diff. all sequences that are similar to X” 3.K nbd (X, Y) = 1/(|Nbd(X)|*|Nbd(Y))  X’  Nbd(X)  Y’  Nbd(Y) K(X’, Y’) 2.Bagged mismatch 1.Run k-means clustering n times, giving p = 1,…,n assignments c p (X) 2.For every X and Y, count up the fraction of times they are bagged together K bag (X, Y) = 1/n  p 1(c p (X) = c p (Y)) 3.Combine the “bag fraction” with the original comparison K(.,.) K new (X, Y) = K bag (X, Y) K(X, Y)

39 Some Benchmarks

40 Google-like homology search The internet and the network of protein homologies have some similarity—scale free Given query X, Google ranks webpages by a flow algorithm  From each webpage W, linked nbrs receive flow  At time t+1, W sends to nbrs flow it received at time t  Finite, ergodic, aperiodic Markov Chain  Can find stationary distribution efficiently as left eigenvector with eigenvalue 1 Start with arbitrary probability distribution, and multiply by the transition matrix

41 Google-like homology search Weston, Elisseeff, Zhu, Leslie, Noble, PNAS 2004 RANKPROP algorithm for protein homology First, compute a matrix K ij of PSI-BLAST homology between proteins i and j, normalized so that  j K ji = 1 1.Initialization y 1 (0) = 1; y i (0) = 0 2.For t = 0, 1, …, 3. For i = 2 to m 4. y i (t+1) = K 1i +   K ji y j (t) In the end, let y i be the ranking score for similarity of sequence i to sequence 1 (  = 0.95 is good)

42 Google-like homology search For a given protein family, what fraction of true members of the family are ranked higher than the first 50 non-members?

43 Protein Structure Prediction

44 Protein Structure Determination Experimental  X-ray crystallography  NMR spectrometry Computational – Structure Prediction (The Holy Grail) Sequence implies structure, therefore in principle we can predict the structure from the sequence alone

45 Protein Structure Prediction ab initio  Use just first principles: energy, geometry, and kinematics Homology  Find the best match to a database of sequences with known 3D- structure Threading Meta-servers and other methods

46 Ab initio Prediction Sampling the global conformation space  Lattice models / Discrete-state models  Molecular Dynamics Picking native conformations with an energy function  Solvation model: how protein interacts with water  Pair interactions between amino acids Predicting secondary structure  Local homology  Fragment libraries

47 Lattice String Folding HP model: main modeled force is hydrophobic attraction  NP-hard in both 2-D square and 3-D cubic  Constant approximation algorithms  Not so relevant biologically

48 Lattice String Folding


Download ppt "Protein Classification. Given a new protein, can we place it in its “correct” position within an existing protein hierarchy? Methods BLAST / PsiBLAST."

Similar presentations


Ads by Google