Presentation on theme: "gSpan: Graph-based substructure pattern mining"— Presentation transcript:
1gSpan: Graph-based substructure pattern mining Authors: Xifeng Yan and Jiawei HanPresented by: Colin Luther
2Copyright note:This presentation was originally provided by Prof. Xifeng Yan upon request from a student. Citation: Xifeng Yan and Jiawei Han. gSpan: graph- based substructure pattern mining. In IEEE International Conference on Data Mining (ICDM), 2002
4BackgroundFrequent Subgraph Mining is an extension to existing frequent pattern mining algorithmsA major challenge is to count how many instances of a pattern are in the datasetCounting instances might be easy for sets, but subtle for graphsRecall the graph isomorphism problem
5Two Isomorphic graph (a) and (b) with their mapping function (c) BackgroundG1=(V1,E1,L1)G2=(V2,E2,L2)f(V1.1) = V2.2f(V1.2) = V2.5f(V1.3) = V2.3f(V1.4) = V2.4f(V1.5) = V2.1Y5V3X31W1V4WU2Y2U54X(a)(b)(c)Two Isomorphic graph (a) and (b) with their mapping function (c)Two graphs are isomorphic if one can find a mapping of nodes of the first graph to the second graph such that labels on nodes and edges are preserved.
6Problem: Finding Frequent Subgraphs Problem setting: similar to finding frequent itemsets for association rule discoveryInput: Database of graph transactionsUndirected simple graph (no multiples edges)Each graph transaction has labeled edges/vertices.Transactions may not be connectedMinimum support thresholdsOutput: Frequent subgraphs that satisfy the support threshold, where each frequent subgraph is connected.
8Authors Contribution Representing graphs as strings (like TreeMiner) No candidate generation!“It combines the growing and checking of frequent subgraphs into one procedure, thus accelerates the mining process.”Really fast, still a standard baseline system that most rivals compare their systems to.
9Concepts behind gSpanThe idea is to produces a Depth-First Search (DFS) codes for each edge in graphsEdges are sorted according to lexicographic order of codesYan and Han proved that graph isomororphism can be tested for two graphs annotated with DFS codesStarting with small graph patterns containing 1-edge, patterns are expanded systemically by the DFS searchEmploy anti-monotonic property of graph frequency
10Anti-Monotonicity of graph frequency The frequency of a super-pattern is less than or equal to the frequency of a sub-pattern. Copyright SIGMOD’08
11Lexicographic Ordering in Graph It can tell us the order of two graphs.The design can help us build a similar hierarchy.The design should guarantee easy-growing from one level to the lower level and easy-rolling-up from low level to higher level.It may be difficult to have such design that no two nodes in this tree are same for graph case.It can tell us whether the graph has been discovered.And more, the most important, if a graph has been discovered, all its children nodes in the hierarchy must have been discovered.
12Lexicographic Ordering in Graph 1-edge...2-edge............3-edge.........
13DFS code and Minimum DFS code Depth First Tree and Forward/Backward Edge Set
14DFS code and Minimum DFS code We use a 5-tuple (vi, vj, l(vi), l(vj), l(vi,vj)) to represent an edge. (it may be redudant, but much easier to understand.)Turn a graph into a sequence whose basic element is 5-tuple. Form the sequence in such an order:to extend one new node, add the forward edge that connect one node in the old graph with this new node.Add all backward edge that connect this new node to other nodes in the old graphrepeat this procedure.
15DFS code Y a e0: (0,1,x,y,a) X X a a a e2: (2,0,x,x,a) X b e1: (1,2,y,x,b)YdbZde5: (1,4,x,z,d)be4: (3,1,x,y,b)ZXbcZce3: (2,3,x,z,c)Zv0v3v1v4v2
16Minimum DFS code Each Graph may have lots of DFS code (why?): one smallest lexicographic one is its Minimum DFS Code
17Graph Parent and its Children Given a DFS codec0=(e0,e1,…,en)if c1=(e0,e1,…,en,ex)if c0<c1, thenc0 is c1’s parent,c1 is c0’s child.?XYZabc
19Theorem1. Given two graph G0 and G1, G0 is isomorphic to G1 iff min_dfs_code(G0)=min_dfs_code(G1).2. DFS Code Tree covers all graphs although some tree nodes may represent the same graph3. Given a node in DFS Code Tree, if its DFS code is not its minimum DFS code, prune this node and its all descendants won’t change. “Covering”.
24Conclusion No Candidate Generation and False Test Space Saving from Depth First SearchGood Performance: using “memory Pool” and one major counting improvement, it seems the performance will be improved 5 times more. (but need more testing).
25Exam QuestionsQ1) What two major costs from Apriori-like, frequent substructure mining algorithms did gSpan aim to reduce/avoid?Answer:1) The creation of size k+1 candidate subgraphs from size k frequent subgraphs is more complicated and costly the standard Apriori large itemset generation.2) Pruning false positives is an expensive process. Subgraph isomorphism problem is NP-Complete.
26Exam Questions (cont.)Q2) Which DFS tree does the DFS code below belong to?Answer: tree (c)
27Exam QuestionsQ3) What does gSpan compare when testing for isomorphism between two graphs, and why?Answer: gSpan compares the minimum DFS codes of the two graphs. Given two graphs G and G’, G is isomorphic to G’ if min(G)=min(G’). This theorem allows for a simple string comparison of more complicated graphs. If two nodes contain the same graph but different minimum DFS codes, we can prune the sub-branch of the rightmost of the two nodes. This greatly decreases the problem size.