Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey.

Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey Department of Chemistry

Size of the database Nature of structure’s data Search typeType of similarity Databases of Chemical Structures: Similarity Searching Features Couple of hundreds of thousands of structures Purified, consistent data Row, inconsistent data Structure search Substructure search [DOW96], [BAR93] [DOW96], [BAR93] Substructure similarity search [HAG92], [GWW98], [ART92] [HAG92], [GWW98], [ART92] Supstructure search (structures contained in target structure) Less general More general Graph isomorphism Subgraph isomorphism Maximal common subgraph

Substructure similarity search screening search –based on substructural features that are typically small, fragment substructures –many thousands of structures per second –precedes detailed and time-consuming atom-by-atom search atom-by-atom search (MCS) (Maximal Common Substructure search) –The MCS of a pair of structures is the largest substructure that is present in both structures. –The MCS is interpreted as similarity measure between two structures that corresponds favorably to an “intuitive” notion of chemical similarity –The MCS is of our primary concern because of it’s importance for the search quality and it’s exponential computational complexity. [DOW96], [BAR93], [HAG92], [GWW98], [ART92]

MCS - Maximal Common Substructure search NP-complete problem –Subgraph isomorphism is proven to be NP-complete problem which implies that the MCS is also NP-complete –(at least) Exponential computational complexity Average run-time can be reduced by: –Use faster computer –Use various heuristics –Carry out some computation in pre-processing phase [XUJ96] [BAR93]

Our strategy for MCS search Back-tracking –The back-tracking is used as an common background algorithm for problems with exponential complexity Distributed objects –Distributed computing is explored for increasing processing speed –Persistent objects are essential for robustness of the searching engine Topology-based comparison criteria –Topology-based features of chemical structures are found attractive for structure efficient description –Topological queries and indexing in collection of distributed objects are considered as promising approach in similar applications –Our heuristics for reducing average searching time and postponing computational explosion to the structures of the size as big as possible are based on substructure-by-substructure instead of atom-by-atom search [XUJ96], [EST98], [WAN98] [PSV99]

Experimental results - question Compare searching time with and without topology-based criteria, for the same set of target structures and the same set of database structures. The topology criterion based on loop number is used: An atom X matches atom Y iff they have the same atom types and number of loops that X belongs to is not greater than that Y belongs to..In order to examine how atom types influence searching process, the same set of target structures is applied including as well as excluding hydrogens. Is there any searching speed-up due to introduction of topology-based comparison criteria ?

Search with Hydrogens excluded

Search with Hydrogens included

Experimental results - answer Is there any searching speed-up due to introduction of topology-based comparison criteria ? - YES Searching speed-up is evident if topology-based criteria are applied. Oscillations in searching time indicate further potential for improving speed. Exponential complexity remains (both curves have the same growing tendency), but by introducing topology-based criteria point of the run- time explosion is translated into the area of much more complex structures. Relative improvement is higher for the case where structures without hydrogens are considered. If such a conclusion can be made for specific atom types, then much better results can be expected for the case of specific substructure type.

Experimental results - question Does topology-based comparison criteria improve substructure similarity measure? Compare structures from the sets of resulting structures obtained by searching with and without topology-based criteria, for the same set of target structures and the same set of database structures. Is there any improvement in quality of the searching results due to introduction of topology-based comparison criteria ?

Target structure

Two of resulting structures The structure is eliminated

Experimental results - answer Is there any improvement in quality of the searching results due to introduction of topology-based comparison criteria ? - YES Decreasing number of resulting structures. Increased probability for expected structures to be found in the set of resulting structures.

Serializable hyper-graph Different characteristic substructures are represented on an uniform way Efficient implementation of topology-based comparison criteria Pointer-based data structure with no extra delay due to serialization Persistent storage of such objects is straightforward Easy to adopt to any distributed objects technology

Hyper-graph: definitions Definition: A hyper-graph HG is an ordered two-tuple HG = (C,E), where C is set of hyper-graphs that are containers of HG, and E is a set of hyper-graphs that are elements of HG : C = { c | c > HG }, E = { e | e < HG } Definition: An undirected hyper-graph HG is an ordered two-tuple HG = ((C, E), I), where ( C,E) is hyper-graph, and I is set of undirected hyper-graphs that are neighbors of the HG. We say that HG is in undirected connection relation with its neighbors. Definition: The undirected connection relation is an equivalence relation.

Hyper-graph: definitions (con’t) Definition: An directed hyper-graph HG is an ordered three-tuple HG = ((C, E), I, O), where ( C,E) is hyper-graph, I is set of directed hyper-graphs that are input neighbors of the HG, and O is set of directed hyper-graphs that are output neighbors of the HG. We say that HG is in directed connection relation with its neighbors. Definition: The directed connection relation is an order relation. Note: We use the undirected hyper-graph in MCS.

Hyper-graph: example v1 v5 v7 v8 v6 v4 v2 v3 e23 e12 e45 e24 e35 e57 e46 e67 e68 v1: id = v1; type = VERTEX; Container = {G1}; Elements = {}; InElements = {e12}; v2: id = v2; type = VERTEX; Container = {G1}; Elements = {}; InElements = {e12, e23, e24}; G1: id = G1; type = GRAPH; Container = {}; Elements = {v1, …, v8, e12, e23, …,e68}; InElements = {};... e12: id = e12; type = EDGE; Container = {G1}; Elements = {}; InElements = {v1,v2}; e23: id = e23; type = EDGE; Container = {G1}; Elements = {}; InElements = {v2, v3};...

Hyper-graph: example (con’t) After simple-loop reduction v5 v7 v6 v4 e45 e57 e46 e67 G2: id = G2; type = GRAPH; Container = {}; Elements = {g1,g2,g3,g4, e1,e2,e3,e4}; InElements = {}; v1 v2 e12v5 v4 v2 v3 e23 e45 e24 e35 v8 v6 e68 g1g2g3g4 e1e2e3 g1: id = g1; type = GRAPH; Container = {G2}; Elements = {v1,v2,e12}; InElements = {e1}; g2: id = g2; type = LOOP; Container = {G2}; Elements = {v2,v3,v4,v5,e23,e24,e35,e45}; InElements = {e1, e2}; e1: id = e1; type = EDGE; Container = {G2}; Elements = {v2}; InElements = {g1,g2}; e2: id = e2; type = EDGE; Container = {G2}; Elements = {v4,v5,e45}; InElements = {g2, g3};

Hyper-graph: class hierarchy

Conclusions Experimental analysis proved again the fact pointed out in a literature that topological information about chemical structure (information about loops in the experiments) can improve substructure similarity searching. Because the MCS is NP-complete problem, efficiency of the applied computing model is very important. Distributed objects is currently the most promising computational approach. Hence, it should be applied to substructure similarity search in chemical structure databases. The proposed hyper-graph model is able to efficiently represent both topology and behavioral characteristics of a chemical structure, in a hierarchical way. Due to efficient serialization method, the object representation of the hyper-graph can be incorporated at any distributed technology (i.g. CORBA) without decreasing execution efficiency.

References [DOW96]Downs, G.M., and Willett, P. (1995), Similarity searching in databases of chemical structures., Rev. Comput. Chem., 7, 1-66. [GWW96]Gillet, V.J., Wild, D.J., Willet, P., and Bradshaw, J. (1998), Similarity and dissimilarity methods for processing chemical structure databases., The Computer Journal, 41, No. 8, 547- 558. [HAG92]Hagadone, T.R., (1992), Molecule substructure similarity searching: Efficient retrival in two- dimensional structure databases., J. Chem. Inf. Comput. Sci., 32, 515-521. [WAN98]Wang, T., and Zhou, J., (1998), 3DFS: A new 3D flexible searching system for use in drug design., J. Chem. Inf. Comput. Sci., 38, 71-77. [XUJ96]Xu, J., (1996), GMA: A generic match algorithm for structural homomorphism, isomorphism, and maximal common substructure match and its applications., J. Chem. Inf. Comput. Sci., 36, 25-34. [PSV99]Papadimitriou, C.H., Suciu, D., and Vianu, V., (1999), Topological queries in spatial databases., Journal of Comput. and Sys. Sci., 58, 29-53. [ART92]Artymiuk, J., et. all., (1992), Similarity searching of three-dimensional molecules and macromolecules., J. Chem. Inf. Comput. Sci., 32, 617-630. [BAR93]Barnard, J.M., (1993), Substructure searching methods: Old and New., J. Chem. Inf. Comput. Sci., 33, 532-538. [EST98]Estrada, E., (1998), Spectral moments of the edge adjacency matrix in molecular graphs., J. Chem. Inf. Comput. Sci., 38, 23-27.

Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey.

Similar presentations

Presentation on theme: "Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey.

Similar presentations

Presentation on theme: "Persistent object-oriented hyper-graph model for Maximal Common Substructure (MCS) search Milorad Tosic, Ph.D. Rutgers, The State University of New Jersey."— Presentation transcript:

Similar presentations

About project

Feedback