Presentation is loading. Please wait.

Presentation is loading. Please wait.

4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts.

Similar presentations


Presentation on theme: "4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts."— Presentation transcript:

1 4. Molecular Similarity

2 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts Pruning Search Space Aggregate Queries LSH

3 3 Historical Progression Maximum Common Subgraph-Isomorphism (MCS) –maximum common substructure between to molecules. –“NP-complete” Structural Keys –dictionary of predetermined, domain-specific sub-structures keyed to particular positions in a bit-vector constructed for each molecule –similarity computed between bit-vectors (fast O(D) scan) 2D Compressed Fingerprints –ALL substructures stored in a bit-vector using a hashing scheme plus lossy compression (modulo operator) –Similarity computed between bit-vectors or count vectors Faster Searches –database pruning –locality sensitive hashing (LSH): towards O(log n) similarity searching

4 4 Superstructure and Substructure Searches A is a superstructure of B (ignoring H) B is a substructure of A Tversky similarity A B

5 5 How similar? The Similarity Problem

6 6 Spectral Similarity 1.Count substructures 2.Compare the count/bit vectors

7 7 2D Graph Substructures For chemical compounds –atom/node labels: A = {C,N,O,H, … } –bond/edge labels: B = {s, d, t, ar, … } Trace ALL Paths O(N*d l ) Cycles and trees Combinatorial Space (CsNsCdO)

8 8 Mapping Structures to Bits Compact data representation Hash each path to bit vector Feature space → Bit space Resolve clashes with OR operator (i.e 1+1=1)

9 9 Similarity Measures There are many ways of measuring similarity (or distance) between bit/count vectors: –Euclidean –Cosine –Exponentials –Tanimoto/Jaccard –Tversky –MinMax –And many more (L1,L2,Lp,Hamming, Manhattan,….)

10 10

11 11

12 12 Similarity Measures: Tanimoto Tally features: –Unique (a,b) –Both on (c) –Both off (d) Similarity Formula –Tanimoto=c/(a+b+c) acb A B

13 13 Fingerprint bit similarity approximates chemical feature similarity. The Fingerprint Approximation

14 14 Similarity Measures: Tversky Tally features: –Unique (a,b) –Both on (c) –Both off (d) Similarity Formula –Tanimoto=c/(a+b+c) –Tversky(α,β)=c/(αa+βb+c) acb A B

15 15 Pathological Cases On the Properties of Bit String-Based Measures of Chemical Similarity. Flower DR, J. Chem. Inf. Comput. Sci. 1998, 38, 379-386

16 16 Pathological Cases Issue of labeling scheme.

17 17 MinMax similarity is a generalization of Tanimoto which uses the counts. MinMax can work better than Tanimoto. Counts

18 18 Pruning Search Space Using Bounds Linear speedup (search CxD) for fixed threshold, often by one order of magnitude or more. Sub-linear speedup (search CxD 0.6 ) for top K.

19 19

20 20 Speedup from Pruning Speedup depends on: –Threshold –Query –Fingerprint length –Database size

21 21

22 22

23 23 Bias in Query Distribution

24 24

25 25

26 26 Aggregate Queries (“Profiles”)

27 27 Two Basic Strategies Similar to bioinformatics 1.Aggregate individual pairwise measures 2.Build a fingerprint profile –Linear approaches –Non-linear approaches (consensus, modal, etc) Hybrid (profile + aggregation/”scaling”)) Profile-profile

28 28 Aggregations

29 29 Consensus Fingerprints Create consensus fingerprint Search database using the consensus & =

30 30 Local Sensitive Hashing Bin fingerprints based on projections onto randomly directed vectors log D random vectors → O(log D) Search for neighbors by returning bin corresponding to the query’s projection Has been used for clustering. May be useful for building diverse data sets. Not yet developed for searching

31 31 Outline Historical Progression Similarity Measures Fingerprint Construction Pathologic Cases MinMax- Counts Pruning Search Space Aggregate Queries LSH


Download ppt "4. Molecular Similarity. 2 Similarity and Searching Historical Progression Similarity Measures Fingerprint Construction “Pathological” Cases MinMax- Counts."

Similar presentations


Ads by Google