by Dayu Yuan The Pennsylvania State University

by Dayu Yuan The Pennsylvania State University
Graph Feature Mining for Indexing and Classification A Thesis Proposal in Department of Computer Science and Engineering by Dayu Yuan The Pennsylvania State University © Dayu Yuan 4/26/2017

Welcome Committee Chairs: CSE Department Faculty Members:
Dr. Prasenjit Mitra Dr. C. Lee Giles CSE Department Faculty Members: Dr. Jessy Barlow Dr. Daniel Kifer Outside Member: Dr. Zan Huang © Dayu Yuan 4/26/2017

Outline 1. Motivation & Introduction
2. Graph index design for subgraph search (WebDB 2011) 3. Graph feature mining for subgraph search (In submission) 4. Future work © Dayu Yuan 4/26/2017

Motivation: Graphs are prevalent: Chemistry: Chemical Molecule
Biology: Protein Structure Computer Aided Design: Image Processing & Computer Vision: Social Network: Mine and Manage Graph Data Above figures all come from internet © Dayu Yuan 4/26/2017

Our focus Data: Graph Database: a collection of graphs
Graph Scale: hundreds of nodes & edges Chemical Molecules Small Social Communities Mechanical Parts Graph Type: labeled, connected, undirected graph (can be extend to other types) Give the intuitve definition of subgraph features a b c c d c d c c c d c d a a a a a d b a b d b d b a b g5 g1 g2 g3 g4 © Dayu Yuan 4/26/2017

Graph Feature/Pattern
Nothing but Subgraphs/Subtrees/Random walk paths Exponential number of subgraphs in a graph database Impossible to enumerate Frequent subgraphs are popular c c d c d c c a Features b b b a b d P1 P2 P3 P4 Graph Database a b c c d c d c c c d c d a a a a a d b a b d b d b a b g5 g1 g2 g3 g4 © Dayu Yuan 4/26/2017

Graph Feature/Pattern
b c P1 b c d a P2 c b d P3 b c d P4 P1: g1, g2, g3, g4 P2: g1, g4 P3: g2, g3, g4, g5 P4: g1, g3 Features Graph Database P1 and P3 will enlarge a b c c d c d c c c d c d a a a a a d b a b d b d b a b g5 g1 g2 g3 g4 © Dayu Yuan 4/26/2017

Graph Features in Subgraph Search
In a graph database D = {g1,g2,...gn}, given a query graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph. a c b d q1 Graph Database: D G2 and G4 will enlarge a b c d g5 c d g4 b a a b c d g1 a c b d g2 a c g3 b d © Dayu Yuan 4/26/2017

Graph Features in Subgraph Search
Filtering + Verification (gIndex 04): If a graph g contains the query q, then g has to contain all q’s subgraphs. Features P1: g1, g2, g3, g4 a c b d q1 a b c P1 b c d a P2 c b d P3 b c d P4 P2: g1, g4 P3: g2, g3, g4, g5 P4: g1, g3 Graph Database: D a b c d g5 c d g4 b a a b c d g1 a c b d g2 a c g3 b d © Dayu Yuan 4/26/2017

Graph Features in Supergraph Search
In a graph database D = {g1,g2,...gn}, given a query graph q, the supergraph search algorithm returns all database graphs have q as a supergraph. c d q2 b a Graph Database: D a b c d g5 c d g4 b a a b c d g1 a c b d g2 a c g3 b d © Dayu Yuan 4/26/2017

Graph Features in Supergraph Search
Filter + Verification (cIndex 07): For any subgraph features that are not contained in the query, their supporting sets can be filtered. Features P1: g1, g2, g3, g4 c d q2 b a a b c P1 b c d a P2 c b d P3 b c d P4 P2: g1, g4 P3: g2, g3, g4, g5 P4: g1, g3 Graph Database: D a b c d g5 c d g4 b a a b c d g1 a c b d g2 a c g3 b d © Dayu Yuan 4/26/2017

Graph Features in Graph Classification
Protein activity prediction Drug toxicity prediction Image classification Graph Kernel: Hard to interpret the results and the rules P1: g1, g2, g3, g4 P2: g1, g4 P3: g2, g3, g4, g5 P4: g1, g3 Graph Database: D Conclusion in the next slides a b c d g5 c d g4 b a a b c d g1 a c b d g2 a c g3 b d © Dayu Yuan 4/26/2017

Graph Feature Mining Motivation Challenges:
Graph query operations (subgraph/supergraph search) need features to build the index Graph learning needs features to explicitly represent graphs as vectors Challenges: Indexing: limited memory Classification: curse of dimensionality Exponential number of subgraphs Too many frequent subgraphs, most of them are redundant & not discriminative/informative © Dayu Yuan 4/26/2017

Research in Graph Feature Mining
1. Mine Frequent Subgraphs: (1) Not computational efficient (2) Curse of Dimensionality (3) Most frequent subgraphs are redundant or not discrimiantive. © Dayu Yuan 4/26/2017

2. Batch mode discriminative & frequent subgraph mining First enumerate all frequent subgraphs [Bottleneck] Mine discriminative & frequent subgraphs out of frequent subgraphs Challenge: How to set the minimum support in step (A)? Small? Big? © Dayu Yuan 4/26/2017

3. Direct Feature Mining: A. Search for a feature f optimizing an objective function B. Find K features: run above algorithm iteratively (forward feature selection) © Dayu Yuan 4/26/2017

Proposal: Our initial study on Subgraph Search Future work plan
Graph Index Structure Design Graph Feature Mining Future work plan Supergraph Search Graph Index Structure Design (initial study) Graph Classification Graph Descriptor Mining for Classification Irredundant Iterative Feature Mining Conclusion: Usefulness Challenges of graph feature minig History © Dayu Yuan 4/26/2017

2. Graph index design for subgraph search (WebDB 2011) Background of Subgraph Search: Preliminary & Problem Definition Filter + Verification [Feature Based Index Approach] 3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search 4. Future work © Dayu Yuan 4/26/2017

Subgraph Search: Definition
Problem Definition: In a graph database D = {g1,g2,...gn}, given a query graph q, the subgraph search algorithm returns all database graphs containing q as a subgraph. Solution: Brute force: For each query q, scan the dataset, find D(q) Filter + Verification: Given a query q, find a candidate set C(q), then verify each graph in C(q) to obtain D(q) q C(q) D(q) D © Dayu Yuan 4/26/2017

Subgraph Search: Solutions
Filter + Verification: Rule: If a graph g contains the query q, then g has to contain all q’s subgraphs. Inverted Index: <Key, Value> pair Left: subgraph features (small segment of subgraphs), Right: Posting List (IDs of all db graphs containing the “key” subgraph) Given a query q, containing subgraph features p1 and p3. Then the candidate graphs are {g1, g2} \intersect {g1, g2, g3}, which is {g1, g2}. And g3 is filtered out. For g1 and g2, we need to run subgraph isomorphism test to exam whether they are the true answer. © Dayu Yuan 4/26/2017

Subgraph Search: Related Work
Total Query Processing Time: (1) filtering cost: D to C(q) Cost of the search for subgraph features contained in the query Cost of loading the postings file, cost of intersecting the postings (2) verification cost: C(q) to D(q) subgraph isomorphism tests NP-complete, dominates overall cost Related work: Reduce the verification cost by mining subgraph features Disadvantages: (1) “Batch mode” feature mining (2) Different index structure designed for different features © Dayu Yuan 4/26/2017

2. Graph index design for subgraph search (WebDB 2011) Background of Subgraph Search: Lindex: A general index structure for subsearch Effective (filtering power) Efficient (response time) Compact (memory consumption) Experiment Results 3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search 4. Future work © Dayu Yuan 4/26/2017

Lindex: A general index structure
Lattice-like Index: Organize indexing features with a lattice Partition the value set (supporting set/ posting list) Contributions: Orthogonal to related work (feature mining) Applicable to all subgraph/subtree features. Lindex decouples feature mining and index structure design Compact, Effective and Efficient © Dayu Yuan 4/26/2017

Lindex: Effective in Filtering
Definition (maxSub, minSuper). S is all indexing features (1) sg2 and sg4 are maxSub of q (2) sg5 is minsup of q back © Dayu Yuan 4/26/2017

Strategy One: Minimal Supergraph Filtering Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is (1) sg2 and sg4 are maxSub of q (2) sg5 is minsup of q (3) © Dayu Yuan 4/26/2017

Strategy Two: Postings Partition Direct & Indirect Value Set. Direct Set: such that sg can extend to g, without being isomorphic to any other features Indirect Set: Index Why “b” is in the direct value set of “sg1”, but “a” is not? Data Based Graphs © Dayu Yuan 4/26/2017

Given a query q and Lindex L(D,S), the candidate set on which an algorithm should check for subgraph isomorphism is b c In Fig. 1, let us say that the query is the graph a. The maximal subgraphs of a are determined to be sg1 and sg2 and the minimal supergraph of a is sg4. The algorithm takes the intersection of the direct value sets of sg1 and sg2. The lists shown in the figure beside each node depicts its value-set. The first list is the direct value-set and the second list is the indirect value-set. The algorithm takes the intersection of the direct value-sets of the maximal subgraphs and obtains the candidate set {a, c} ∩ a = a for verification. The union of the value-sets of the minimal supergraph sg4 is {c} and that is directly included in the answer and deducted from the candidate set. Finally, the set {a} is taken and its element a is verified to contain the subgraph of the query and is added to the answer set. The answer set {a, c} is output. Query “a” Graphs need to be verified Traditional Model Strategy(1) Strategy(1 + 2) © Dayu Yuan 4/26/2017

Lindex: Compact Space Saving (Extension Labeling)
Each Edge in a graph is represented as: <ID(u), ID(v), Label(u), Label(edge(u, v)), Label(v)> the label of the graph sg2 is < 1,2,6,1,7 >,< 1,3,6,2,6 > the label of its chosen parent sg1 is < 1,2,6,1,7 > Then subgraph g2 can be stored as just < 1, 3, 6, 2, 6 > < 1,2,6,1,7 > < 1, 3, 6, 2, 6 > © Dayu Yuan 4/26/2017

Lindex: Empirical Evaluation of Memory
Index\Feature DFG ∆TCFG MimR Tree+∆ DFT Feature Count 7599/6238 9873/5712 5000 6172/38 7500/6172 Gindex 1359 1534 1348 1339 FGindex 1826 SwiftIndex 860 Lindex 677 841 772 676 671 Unit in KB © Dayu Yuan 4/26/2017

Lindex: Efficient in Maxsub Feature Search
the label of the graph sg2 is < 1,2,6,1,7 >,< 1,3,6,2,6 > the label of its chosen parent sg1 is < 1,2,6,1,7 > Node1 of sg1 mapped to Node1 of sg2 < 1,2,6,1,7 > < 1, 3, 6, 2, 6 > instead of constructing canonical labels for each subgraph of q and comparing them with the existing labels in the index to check if a indexing feature matches while traversing a graph lattice, mappings constructed to check that a graph sg1 is contained in q can be extended to check whether a supergraph of sg1 in the lattice, sg2, is contained in q by incrementally expanding the mappings from sg1 to q. © Dayu Yuan 4/26/2017

Lindex: Efficient in Minsup Feature Search
The set of minimal supergraph of a query q in the Lindex is a subset of the intersection of the set of descendants of each subgraph node of q in the partial lattice. < 1,2,6,1,7 > < 1, 3, 6, 2, 6 > © Dayu Yuan 4/26/2017

Lindex: Experiments Exp on AIDS Dataset: 40,000 Graphs © Dayu Yuan
4/26/2017

2. Graph index design for subgraph search (WebDB 2011) 3. Graph feature mining for subgraph search (In submission) Direct feature mining for sub search Problem Definition & Objective Function Branch & Bound Heuristic-based search space exploration: Partition of the search space Experiment Results 4. Future work © Dayu Yuan 4/26/2017

Feature Mining: Motivation
All previous feature selection algorithms for “subgraph search problem” follow “batch mode” Assume stable database Bottleneck (frequent subgraph enumeration) Hard to tune the setting of parameters (minimum support, etc) Our Contributions: First direct feature mining algorithm for the subgraph search problem Effective in index updating Choose high quality features © Dayu Yuan 4/26/2017

Feature Mining: Problem Definition
Iterative Index Updating: Given database D, current index I with features P0 (1) Remove the Least Useful Feature (2) Add a New Feature (3) Goes to (1) until it converge P1: g1, g2, g3, g4 q1 P0 P2: g2, g3, g4, g5 c d a b c p1 c b d p2 b c d a p3 a b d p4 P3: g1, g4 a P4: g2, g3 b a Graph Database: D a b c d g5 c d g4 b a a b c d g1 a c b d g2 a c g3 b d © Dayu Yuan 4/26/2017

Previous work: Given a graph database D, find a set of subgraph (subtree) features, minimizing the response time over training queries Q. Our work: Given a graph database D, an already built index I with feature set P0, search for a new feature p, such that the new feature set {P0 + p} minimizes the response time © Dayu Yuan 4/26/2017

Iterative Index Updating: Given database D, current index I with features P0 (1) Remove the Least Useful Feature Find a feature p in P0 (2) Add a New Feature Find a new feature p (3) Goes to (1) © Dayu Yuan 4/26/2017

Feature Mining: More on the Object Function
(1) Pros and Cons of using the query logs The objective function of previous algorithms (i.e. Gindex, FGindex) depends on queries too. [Implicitly] (2) Feature selected are “discriminative” Previous work: the discriminative power of ‘sg’ is measured w.r.t to sub(sg) or sup(sg), where sub(sg) denotes all subgraphs of ‘sg’, and sup(sg) denotes all supergraph of ‘sg’. Our objective function: discriminative power is measure w.r.t P0 (3) Computation Issue (next slides): © Dayu Yuan 4/26/2017

Feature Mining: More on the Object Function (cont.)
MinSup Queries(p, Q) Q Example Computing D(p) for each enumerated feature ‘p’ is expensive © Dayu Yuan 4/26/2017

Feature Mining: Estimate The Objective Function
The objective function of a new subgraph feature p, has an easy to compute upper bound and lower bound: Inexpensive to compute Two approaches to estimate (1) Lazy calculation: don’t have to calculate gain(p, P0) when Upp(p, P0) < gain(p*, P0) Low(p, P0) > gain(p*, P0) (2) Omit Prof © Dayu Yuan 4/26/2017

Feature Mining: Challenges
(1) Exponential search space for the new index subgraph feature “p”. (2) Objective function is neither monotonic nor anti- monotonic. [Apriori rule can not be used] (3) Traditional heuristic-based graph feature mining algorithms (e.g. LeapSearch) do not work. (They rely only on “frequencies”) © Dayu Yuan 4/26/2017

Feature Mining: Branch and Bound
Exhaustive Search according to DFS Tree A graph(pattern) can be canonically labeled as a string, the DFS tree is a pre-fix tree of the labels of graphs. n1 For each branch, e.g., branch starting from n5, find an branch upper bound > gain value of all nodes on that branch. n2 n7 n5 n6 n3 n4 The objective function is neither monotonic or anti-monotonic © Dayu Yuan 4/26/2017

Feature Mining: Branch and Bound
Thm For a feature p, an upper bound exists such that for all p’ that are supergraph of p, gain(p’, P0) <= BUpp(p, P0) n1 n2 n7 n5 n6 n3 n4 Omit Prof © Dayu Yuan 4/26/2017

Feature Mining: Heuristic based search space partition
Problem: The search always starts from the same root and search according to the same order Observation The new graph pattern p must be a super graph of some patterns in P0, i.e., p ⊃ p2 in Figure 4 1) A great proportion of the queries are supergraphs of root, otherwise there will be few queries using p ⊃ r for filtering 2) The average size of the set of candidates for queries ⊃ r are large, which means improvement over those queries is important. © Dayu Yuan 4/26/2017

Feature Mining: Heuristic based search space partition
Procedure: (1)gain(p*)=0 (2)Sort all P0 according to sPoint(pi) function in decreasing order (3) Start Iterating For i=1to|P| do If branch upper bound of BUpp(ri) < gain(p∗) then break Else Find the minimal supergraph queries minSup(r, Q) p*(r) = Branch & Bound Search (minSup(r, Q), p∗) If gain(p*(r)) > gain(p∗), update p∗ = p∗(r) Discussion: (1) Candidate features are enumerated as descendents of the “root”. (Partition of the search space) (3) The “root” is visited according to sPoint(r) score; quick to find a close to optimal feature. (2) Candidate features are ‘frequent’ on D(r), not all D Smaller minimum support (4) Top k feature selection © Dayu Yuan 4/26/2017

Feature Mining: Experiment
The AIDS dataset D (40K chemical molecules), Index0: Gindex with minsupport 0.05 IndexDF: Gindex with minsupport 0.02 [1175 new feature are added] Index QG/BB/TK (Index updated based on Index0) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration Achieving the same candidate set size decrease © Dayu Yuan 4/26/2017

2 Dataset: D1 & D2 (80% same) DF(D1): Gindex on Dataset D1 DF(D2): Gindex on Dataaset D2 Index QG/BB/TK (Index updated based on DF(D1)) BB: branch and bound QG: search space partitioned TK: top k feature returned in on iteration Exp1: D2 = D1 + 20% New Exp2: D2 = 80%D1 + 20%New Iterative until the objective value is stable © Dayu Yuan 4/26/2017

2. Graph index design for subgraph search (WebDB 2011) 3. Graph feature mining for subgraph search (In submission) 4. Future work Supergraph Search Index structure Feature Selection for Supergraph Search Graph feature mining for classification Time line Conclusion So Far: solving the subgraph search problem 1. Lindex: index structure general enough to support any features Compact Effective Efficient 2. Direct feature mining Third generation algorithm (no frequent feature enumeration bottleneck) Effective in updating the index to accommodate changes Runs much faster than building the index from scratch Feature selected can filter more false positives than features selected from scratch. © Dayu Yuan 4/26/2017

Future work: supergraph search
Exclusive Logic Filtering: For any subgraph features that are not contained in the query, their supporting sets can be filtered. Problem Definition: In a graph database D = {g1,g2,...gn}, given a query graph q, the supergraph search algorithm returns all database graphs have q as a supergraph. In Memory Model: Too many disk operations if the postings are stored on disk © Dayu Yuan 4/26/2017

On Disk Model & Feature Selection Features are organized in a lattice (same as Lindex) Each feature is associated with one value set (disjoint value set) The value set of feature f Query processing model: Sg1 Sg3 Sg2 Value Set Sg1：［a］ Sg2：［c］ Sg3：［b］ b a c Although, the above query processing model over poIndex does not involve explicit filtering step, it actually prunes some false graphs, such as the graph g in the Figure 5.1. But there are some false positive graphs that can not be filtered out by the poIndex, but can be pruned by using the explicit logic filtering. For example, graph g′ contains a feature f′ that is not contained in the query graph q, but when we assign the value sets, we assign g′ to the value set of feature f1 which is contained in q. g′ can be pruned out by applying the explicit logic filtering explicitly, but can not be filtered by the query processing model over poIndex. In order to minimize such g′ and improves the filtering power of poIndex, we need the support of a feature selection algorithm and also an value set assigning algorithm, which will be discussed in the next subsection in details. q © Dayu Yuan 4/26/2017

On Disk Model & Feature Selection Features are organized in a lattice (same as Lindex) Each feature is associated with one value set (disjoint value set) The value set of feature f Query processing model: Sg1 Sg3 Sg2 Value Set Sg1：［］ Sg2：［a,c］ Sg3：［b］ b a c Although, the above query processing model over poIndex does not involve explicit filtering step, it actually prunes some false graphs, such as the graph g in the Figure 5.1. But there are some false positive graphs that can not be filtered out by the poIndex, but can be pruned by using the explicit logic filtering. For example, graph g′ contains a feature f′ that is not contained in the query graph q, but when we assign the value sets, we assign g′ to the value set of feature f1 which is contained in q. g′ can be pruned out by applying the explicit logic filtering explicitly, but can not be filtered by the query processing model over poIndex. In order to minimize such g′ and improves the filtering power of poIndex, we need the support of a feature selection algorithm and also an value set assigning algorithm, which will be discussed in the next subsection in details. q © Dayu Yuan 4/26/2017

On Disk Model & Feature Selection Tentative Solution to this problem: Banking float maximization problem (similar to max cover problem). NP-hard problem. Polynomial solution exists with approximate ratio (1-1/e) Scalability issues: (a) Solve the problem with Map-reduce (map-reduce solution exists for the max cover problem [63, 65]) (b) Direct Feature Mining © Dayu Yuan 4/26/2017

Future work: graph classification
Feature vector for graph data: Given a set of N graph features p1 , p2 , ..., pn , a graph g can be coded as a vector Xg=[x1, x2, ..., xn]T where xi is a {0, 1} valued variable and xi = 1 if and only if pi ⊆ g for all i. Compared with Graph Kernel:: Interpretability Graph Vectors based on feature p1, p2, p3 and p4 © Dayu Yuan 4/26/2017

Previous Work: (1) Class two algorithms: [batch mode] (2) Class three algorithms: [direct mining] Iterative mining: at each iteration, only feature is selected Redundancy: a feature f with very low objective function value in iteration i, may be enumerated in the following iterations as well. (3) Heuristic based feature mining: The search space of subgraph features is explored randomly. Close to optimal features tend to be discovered quickly. (Much faster than exact search) Why not use other descriptors instead of subgraphs as features? © Dayu Yuan 4/26/2017

Direction One: Descriptor Mining Tentative Solution: Use the information diffusion models to build a descriptor capturing the local context and topology information. © Dayu Yuan 4/26/2017

Direction One: Descriptor Mining What is Information Diffusion? (Word of Mouth effect) Different models General Assumption Nodes are either active or inactive; Active nodes may cause other to activate; active nodes never deactivate Linear Threshold Model [69] Independent Cascade Model [70] Build a descriptor: (1) For each node v, activate only v and trigger the information diffusion procedure. (2) Collect all active nodes to build the descriptor. © Dayu Yuan 4/26/2017

Information Diffusion Models
Linear Threshold Model [69] A node v is influenced by each neighbor w with weight b(v, w) Node v will be activated if is a random variable in [0, 1] Independent Cascade Model [70] When node v is activated, it has a one and only one chance of activating its inactive neighbor w. The activation attempt succeeds with probability pvw © Dayu Yuan 4/26/2017

Build a descriptor: (1) For each node v, activate only v and trigger the information diffusion. (2) Collect all active nodes to build the descriptor. (3) P1 denotes the probability of a node with label `a’ becomes active DesA Probability Label a p1=1 Label b p2=.8 Label c p3=.9 Label d p4=.4 …… DesB Probability Label a P1=0.5 Label b p2=.4 Label c p3=.3 Label d p4=.7 …… Des(v, l) Node v, label l © Dayu Yuan 4/26/2017

Mining descriptors for graph classification Convert to the probabilistic item set mining problem. Future plan (1) Try different information diffusion models to see their cons and pros on modeling the `local’ context and topology information. (2) Explore discriminative probabilistic item-set mining algorithms for (binary) classification. © Dayu Yuan 4/26/2017

Direction Two: Irredundant Iterative Feature Mining Embedding Fading approach: For example, in Figure 5.2. After detection f is not significant w.r.t the objective function, we lower the weight of nodes and edges of f’s embedding on graph g1 and g2. For the above example, if we assume that the dashed edge weight will drop from 1 to 0.5. Then, in the next iteration, the frequency of f decrease from 2 to 1. The frequency of f1 will not be affected much, since f1’s embedding does not overlap much with f. © Dayu Yuan 4/26/2017

Backup Slides 1 A B 2 1 A B 2 1 A C 2 f1 A g1 D 3 D 3 A 4 3 B 1 A B 2
5 g2 q f2 A C 3 C 3 Features Database Graph Query Graph © Dayu Yuan 4/26/2017

by Dayu Yuan The Pennsylvania State University

Similar presentations

Presentation on theme: "by Dayu Yuan The Pennsylvania State University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

by Dayu Yuan The Pennsylvania State University

Similar presentations

Presentation on theme: "by Dayu Yuan The Pennsylvania State University"— Presentation transcript:

Similar presentations

About project

Feedback