University of Illinois at Urbana-Champaign Graph Indexing: Tree + Δ ≥ Graph Peixiang Zhao Jeffrey Xu Yu Philip S. Yu Peixiang Zhao Jeffrey Xu Yu Philip.

Slides:

Advertisements

Similar presentations

Indexing DNA Sequences Using q-Grams

Advertisements

1 gStore: Answering SPARQL Queries Via Subgraph Matching Presented by Guan Wang Kent State University October 24, 2011.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.

Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.

Frequent Closed Pattern Search By Row and Feature Enumeration

Fast Algorithms For Hierarchical Range Histogram Constructions

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Connected Substructure Similarity Search Haichuan Shang The University of New South Wales & NICTA, Australia Joint Work: Xuemin Lin (The University of.

1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Graph Substructure Search Xuemin Lin School of Computer Science and Engineering University of New South Wales Sydney, Australia.

IGraph: A Framework for Comparisons of Disk-Based Graph Indexing Techniques Jeffrey Xu Yu et. al. VLDB ‘10 Presented by Tao Yu.

Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

International Workshop on Computer Vision - Institute for Studies in Theoretical Physics and Mathematics, April , Tehran 1 IV COMPUTING SIZE.

Near-Optimal Network Design with Selfish Agents By Elliot Anshelevich, Anirban Dasgupta, Eva Tardos, Tom Wexler STOC’03 Presented by Mustafa Suleyman CIFTCI.

SubSea: An Efficient Heuristic Algorithm for Subgraph Isomorphism Vladimir Lipets Ben-Gurion University of the Negev Joint work with Prof. Ehud Gudes.

Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.

33 rd International Conference on Very Large Data Bases, Sep. 2007, Vienna Towards Graph Containment Search and Indexing Chen Chen 1, Xifeng Yan 2, Philip.

FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,

Querying Big Graphs within Bounded Resources 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

Graph Indexing Techniques Seoul National University IDB Lab. Kisung Kim

Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.

Graph Indexing: A Frequent Structure based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†

Mining Frequent Itemsets with Constraints Takeaki Uno Takeaki Uno National Institute of Informatics, JAPAN Nov/2005 FJWCP.

VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.

Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )

Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

On Graph Query Optimization in Large Networks Alice Leung ICS 624 4/14/2011.

Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.

Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Graph Query Reformulation with Diversity – Davide Mottin, Francesco Bonchi, Francesco Gullo 1 Graph Query Reformulation with Diversity Davide Mottin, University.

University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.

Mining Frequent Itemsets from Uncertain Data Presenter : Chun-Kit Chui Chun-Kit Chui [1], Ben Kao [1] and Edward Hung [2] [1] Department of Computer Science.

1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor ： Dr. Koh Jia-Ling Speaker ： Tu Yi-Lang Date ： Hong.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.

Mining Top-K Large Structural Patterns in a Massive Network Feida Zhu 1, Qiang Qu 2, David Lo 1, Xifeng Yan 3, Jiawei Han 4, and Philip S. Yu 5 1 Singapore.

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09.

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Graph Indexing: A Frequent Structure-based Approach 指導老師：曾新穆教授組員：李彥寬、洪世敏、丁鏘巽、黃冠霖、詹博丞日期： 2013/11/ /11/141.

Graph Indexing From managing and mining graph data.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {

1 Substructure Similarity Search in Graph Databases R 陳芃安.

Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.

Xifeng Yan Philip S. Yu Jiawei Han SIGMOD 2005 Substructure Similarity Search in Graph Databases.

Gspan: Graph-based Substructure Pattern Mining

Cohesive Subgraph Computation over Large Graphs

Outline Introduction State-of-the-art solutions

Probabilistic Data Management

Mining Frequent Subgraphs

Graph Search with Indexing

Query-Friendly Compression of Graph Streams

Mining Frequent Itemsets over Uncertain Databases

Probabilistic Data Management

Efficient Subgraph Similarity All-Matching

Jongik Kim1, Dong-Hoon Choi2, and Chen Li3

Compact routing schemes with improved stretch

Efficient Processing of Top-k Spatial Preference Queries

Approximate Graph Mining with Label Costs

Presentation transcript:

University of Illinois at Urbana-Champaign Graph Indexing: Tree + Δ ≥ Graph Peixiang Zhao Jeffrey Xu Yu Philip S. Yu Peixiang Zhao Jeffrey Xu Yu Philip S. Yu IBM T. J. Watson Research Center IBM T. J. Watson Research Center September 12 th, 2007 VLDB’07 Vienna, Austria

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 2 of 25 Synopsis IntroductionIntroduction Graph Containment Query Algorithmic Framework Related WorkRelated Work Tree + ΔTree + Δ Indexability of frequent Trees Discriminative graph feature selection: Δ Experimental StudyExperimental Study ConclusionConclusion

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 3 of 25 Introduction Graph is a mathematical construct and a general data structure representing relations among entitiesGraph is a mathematical construct and a general data structure representing relations among entities The emergence and the dominance of graphs asks for effective graph data management and mining tools so that users can organize, access, and analyze graph data efficientlyThe emergence and the dominance of graphs asks for effective graph data management and mining tools so that users can organize, access, and analyze graph data efficiently Structural Pattern Mining: Given a graph database, what are the potentially interesting structural patterns and how can we find them? Graph Indexing and Search: How can we index graphs and perform searching, either exactly or approximately, in large graph databases?

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 4 of 25 Introduction Graph Containment QueryGraph Containment Query Given a graph database G = {g 1, g 2, …, g N } and a query graph q, find the set NP, since subgraph-isomorphism checking is NP-Complete Infeasible to check subgraph isomorphism sequentially for every g i in G, especially challenging when graphs in G are large, or G is large and diverse Graph indexing!

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 5 of 25 Graph Indexing: Algorithmic Framework Index construction generates the index feature set F from the graph database G. For each feature f, sup(f) is maintainedIndex construction generates the index feature set F from the graph database G. For each feature f, sup(f) is maintained Query processing is performed in a filtering-verification fashion:Query processing is performed in a filtering-verification fashion: The filtering phase uses indexing features contained in q to compute the candidate answer set Every graph in C q contains all q's indexed features. Therefore, the query answer set, sup(q), is a subset of C q The verification phase checks subgraph isomorphism for every graph in C q. False positives are pruned and the true answer set sup(q) is returned

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 6 of 25 Query Cost Model The cost of processing a graph containment query q upon G, denoted C, can be modeled as belowThe cost of processing a graph containment query q upon G, denoted C, can be modeled as below C f : the filtering cost C v : the verification cost (NP-Complete) AnalysisAnalysis 1.The key issue to improve query performance is to minimize |C q | 2.The indexing feature set F is quite relevant to C f and |C q | 3.Index construction performance: the feature selection cost C fs to construct F from among G

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 7 of 25 Related Work Path-based Indexing approachPath-based Indexing approach All existing paths up to a certain length lp are enumerated as indexing features –Index can be constructed efficiently –Index size is quite large when lp is not small –Limited pruning power, mainly because the structural information exhibited in graphs is lost when breaking graphs into paths GraphGrep ( PODS’02 ) Graph-based Indexing approachGraph-based Indexing approach Subgraphs of G with different characteristics are selected as indexing features –A costly index construction process –Compact index structure –Great pruning power, since structural information of graph is well-preserved gIndex ( SIGMOD’04, PODS’05 ), C-Tree ( ICDE’06 ), GString ( ICDE’07 ), GDIndex ( ICDE’07 ), FG-Index ( SIGMOD’07 )

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 8 of 25 An alternative approach: (Tree + Δ) Tree-based Graph IndexingTree-based Graph Indexing Tree: Better indexability in comparison with path and graph –The majority of frequent graph-features of G are usually tree-features indeed –Frequent tree-features and graph-features share similar distributions and frequent tree-features have similar pruning power like graph-features –tree mining can be done much more efficiently than graph mining on G Δ : On-demand select a small number of discriminative graph-features without conducting costly graph mining beforehand Orders of magnitude smaller in index size, but performs much better than existing approaches in indexing construction and query processing

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 9 of 25 Indexability of Path, Tree and Graph Frequent features (paths, trees, graphs) expose intrinsic characteristics of a graph database, G. They are representatives to discriminate between different groups of graphs in a graph database Which one should we index? Path, Tree or Graph? 1.The frequent feature set size: | F | 2.The feature selection cost: C fs 3.the candidate answer set size: |C q |

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 10 of 25 The Frequent Feature Set Size:F The Frequent Feature Set Size: | F | Evidences:Evidences: Among all frequent graph-features of G, a majority of them are trees indeed –All subtrees of a frequent graph are frequent –There is little chance that subtrees of frequent graph g coincide with those of frequent graph g ’, due to the structural diversity and label variety Frequent paths share a very small portion, because a path-feature has a simple linear structure, which has little variety in structural complexity In terms of feature distributions, tree-features and graph-features share a very similar distribution w.r.t. feature size

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 11 of 25 Experiments on Two Datasets w.r.t. F Experiments on Two Datasets w.r.t. | F | The Real Dataset The Synthetic Dataset

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 12 of 25 The feature selection cost: C fs Given a graph database, G, and a minimum support threshold, σ, to discover the frequent feature set F (F P / F T / F G ) from G Tree A good compromise between –the more expressive, but computationally harder general graph –the faster but less expressive path Specialization of general graph avoiding undesirable theoretical properties and algorithmic complexity incurred by graph PathTreeGraph Isomorphism O(n) P or NPC (?) Sub-Isomorphism O(n + m) O(m 3/2 n/logm) NP-Complete

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 13 of 25 The Candidate Answer Set Size: |C q | We define the pruning power power(f) of a frequent feature f as The pruning power of a frequent feature set S = {f 1, f 2, …, f n } Theorem 1: Given a frequent graph-feature g, and let its frequent subtree set be T (g) = {t 1, t 2, …, t n }. Then, power(g) ≥ power(T (g)) Theorem 2: Given a frequent tree-feature t, and let its frequent sub-path set be P (t) = {p 1, p 2, …, p m }. Then, power(t) ≥ power(P (t))

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 14 of 25 Pruning Power The pruning power of all frequent subtree features, T (g), of a frequent graph-feature g can be similar to the pruning power of g There is a big gap between the pruning power of a graph- feature g and that of all its frequent sub-path features, P(g) The Real Dataset

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 15 of 25 Indexability of Path, Tree and Graph It is feasible and effective to select F T, instead of F G, as indexing features for the graph containment query problem The frequent tree-feature set, F T, dominates F G Discovering frequent tree-features from G can be done much more efficiently than mining frequent general graph-features F T can contribute similar pruning power like that provided by F G

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 16 of 25 Discriminative Graph Features Consider a query graph q which contains a subgraph g If power(T (g)) ≈ power(g), there is no need to index the graph-feature g, because its subtrees jointly have the similar pruning power if power(g) >> power(T (g)), it will be necessary to select g as an index feature because g is more discriminative than T (g), in terms of pruning Discriminative graph-features (w.r.t. its subtree-features, controlled by ε 0 ) are selected from queries on-demand, without mining the whole set of frequent graph-features from G beforehand Discriminative graph-features are used as additional indexing features, denoted Δ, which can also be reused further to answer subsequent queries Δ

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 17 of 25 Discriminative Graph Selection Given two graphs g, g’ q, where g g’ If the gap between power(g’) and power(g) is large enough, g’ will be reclaimed from G; Otherwise, g is discriminative enough for pruning purpose, and there is no need to reclaim g’ in the presence of g Approximate the discriminative computation between g’ and g, in the presence of our knowledge on frequent tree-features discovered

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 18 of 25 Discriminative Graph Selection The occurrence probability of g in the graph database, G the conditional occurrence probability of g’, w.r.t. g, models the probability to select g’ from G in the presence of g The upper and lower bound of Pr(g’|g) The conditional occurrence probability of Pr(g’|g), is solely upper-bounded by T (g’)

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 19 of 25 Experimental Studies The Real DatasetThe Real Dataset The AIDS antiviral screen dataset from Developmental Theroapeutics Program in NCI/NIH compounds retrieved from DTP's Drug Information System 63 kinds of atoms in this dataset, most of which are C, H, O, S, etc. Three kinds of bonds are popular in these compounds: single-bond, double-bond and aromatic-bond On average, compounds in the dataset has 43 vertices and 45 edges. The graph of maximum size has 221 vertices and 234 edges

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 20 of 25 Experimental Studies The real dataset: index constructionThe real dataset: index construction

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 21 of 25 Experimental Studies The real dataset: false positive ratio (|Cq|/|sup(q)|) w.r.t. the database size (= 1,000; 2,000; 4,000; 8,000; 10,000)The real dataset: false positive ratio (|Cq|/|sup(q)|) w.r.t. the database size (= 1,000; 2,000; 4,000; 8,000; 10,000)

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 22 of 25 Experimental Studies The Synthetic DatasetThe Synthetic Dataset Generated by a widely-used graph generator, which is controlled by the following parameters :

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 23 of 25 Experimental Studies The synthetic dataset: false positive ratioThe synthetic dataset: false positive ratio

Sept. 12 th, 2007 VLDB’07 Vienna, Austria 24 of 25 Conclusion Graph indexing plays a critical role in graph containment query processing on large graph databases Path-based and graph-based indexing approaches suffer from overly large index size, substantial index construction overhead and expensive query processing cost (Tree+Δ) is an effective and efficient graph indexing feature to answer graph containment queries (Tree+Δ) holds a compact index structure, achieves good performance in index construction and most importantly, provides satisfactory query performance for answering graph containment queries over large graph databases

University of Illinois at Urbana-Champaign Thank you VLDB’07 Vienna, Austria