Presentation is loading. Please wait.

Presentation is loading. Please wait.

Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc.

Similar presentations


Presentation on theme: "Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc."— Presentation transcript:

1 Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc E, Cherkasov A., and Sahinalp S.C. Amit Satsangi amit@cs.ualberta.ca

2 © 2006 Department of Computing Science CMPUT 605 Background and Focus  Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin)  Structural similarity  Similar biological and/or physico-chemical properties (Maggiora et al.)  Classification of probe compound (unknown bioactivity)  Similarity search amongst compounds with known bioactivity

3 © 2006 Department of Computing Science CMPUT 605 Background and Focus  Determining similarity distance measures (SDM)  Using SDM for classification of compounds—k-NN classification  Efficient data structures for fast similarity search— DMVP trees (an improvement over SCVP trees used previously)

4 © 2006 Department of Computing Science CMPUT 605 Outline  Similarity measures  Classification techniques  k-NN classifier  DMVP tree  Results, Observations and Conclusion

5 © 2006 Department of Computing Science CMPUT 605 Similarity between Molecules  Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines)  Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools) * Quantitative Structure-Activity Relationship

6 © 2006 Department of Computing Science CMPUT 605 Similarity Measures  Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y:  X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases)  Range of Tanimoto coefficient: [0, 1]

7 © 2006 Department of Computing Science CMPUT 605 Similarity measures  Tanimoto Dist. Measure: D T (X,Y) = 1 –T(X,Y)  Minkowski distance (L P ):  Real valued data possible

8 © 2006 Department of Computing Science CMPUT 605 Classification Techniques  Multiple Linear Regression (MLR)  Linear Discriminant Analysis (LDA)  Artifical Neural Networks (ANN)  Support Vector Machines (SVM)  k-nearest Neighbor (k-NN) classification not used previously.

9 © 2006 Department of Computing Science CMPUT 605 Distance-based Classification  Compounds—s & r  S & R respective descriptor arrays  If D(S,R) is small then bioactivity levels of s & r are similar  Notion of distance  classification of new compounds  Distance measure == metric (conditions) e.g. Hamming Distance, Tanimoto distance etc.

10 © 2006 Department of Computing Science CMPUT 605 k-nn Classification  Given  Bioactivity  To Find  Distance measure that separates active and inactive compounds for the training set N- dimensional plane  Problem  Easy

11 © 2006 Department of Computing Science CMPUT 605 k-nn Classification GGiven  Bioactivity TTo Find  Distance measure that separates active and inactive compounds for the training set N- dimensional plane PProblem  NP-hard SSolution  Use Genetic Algorithms, heuristic linear search to find the plane

12 © 2006 Department of Computing Science CMPUT 605 QSAR approach Uses a linear combination of descriptors Assigns a weight to each dimension, W [0,1] Weighted Minkowski distance of order 1 Only binary classification considered (A/I) Methods are general

13 © 2006 Department of Computing Science CMPUT 605 Parameter Optimization

14 © 2006 Department of Computing Science CMPUT 605 k-NN Classifier  Set of data elements: {X1, … Xn}  Query element: Y  Range query  Find Xi such that D(Y,Xi) < R1 (user defined)  k-nn query  Find k items such that their distance to Y is as small as possible

15 © 2006 Department of Computing Science CMPUT 605 Data structures: VP-Trees  Vantage Point (VP) tree  Choose an arbitrary data point (called Vantage Point)  Binary tree—recursively partitions the dataset into two equal sized subsets  Zero in on the nearest neighbor

16 © 2006 Department of Computing Science CMPUT 605 Efficient data structures: SCVP Trees  Space Covering Vantage Point tree  Multiple vantage points chosen at each level  No more a binary tree—multiple branches at each internal node  Multiple inner partitions—hope is that each data point lies in atleast one inner partition

17 © 2006 Department of Computing Science CMPUT 605 DMVP Tree  Memory requirements of SCVP tree can be large—redundancy of data elements  Deterministic selection of Vantage points  VP minimization—NP-Hard  Minimization == Weighted set cover problem  Use of greedy Algorithm: O(log l); l<n  Approximates the min number of VP’s

18 © 2006 Department of Computing Science CMPUT 605 Experiments  Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug- like(1202)  62 dimensional descriptor array (30 QSAR & 32 physico- chemical properties)  k=1 i.e. one NN  Comparison with LDA, MLR, ANN  70% data used for training  wL 1 distance is calculated in all cases

19 © 2006 Department of Computing Science CMPUT 605 Experimental Results  Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR  ANN beats k-NN on almost all counts  Pruning—more than 80% in each kind of bioactivity (over brute-force search)  Key point – k-NN classifier is faster  More than 100 times faster than ANN

20 © 2006 Department of Computing Science CMPUT 605 Experimental Results  Can calculate the level of bioactivity instead of a YES/NO  The value of the weights provides insights into the importance of descriptors for each bioactivity

21 © 2006 Department of Computing Science CMPUT 605 Observations & Conclusion  Bacterial metabolites & antimicrobial drugs overlap (confirmation)  Human metabolites display distinctive properties  QSAR models for drugs + human metabolites dominated by few descriptors  These descriptors favored by drug developers and natural evolution

22 © 2006 Department of Computing Science CMPUT 605 Observations & Conclusion  Classification results from k-NN can help rationalize the design and discovery of drugs  DMVP tree improves the space utilization of the program  Provides a means for fast similarity search  Data structure can be applied to any metric distance like wL p and Tanimoto distance

23 © 2006 Department of Computing Science CMPUT 605 Thank You For Your Attention!


Download ppt "Faculty of Computer Science © 2006 CMPUT 605February 04, 2008 Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc."

Similar presentations


Ads by Google