Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline.

Similar presentations


Presentation on theme: "Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline."— Presentation transcript:

1 Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline Hoekx (U Hasselt, Belgium)

2 2 Graph Data A (directed) graph over a set of nodes N is a set G of edges: ordered pairs  i  j  with i  j  N. Snapshot of a graph representing the metabolic pathway of a human. Applications: life sciences, biology, social sciences, WWW,...

3 3 Graph Mining Transactional category –dataset: set of many small graphs (transactions) –frequency:  transactions in which the pattern occurs (at least once) –ILP: Warmr [AGM, FSG, TreeMiner, gSpan, FFSM, Horvath-Ramon-Wrobel] Single graph category –dataset: single large graph –frequency:  copies of the pattern in the large graph [Subdue, Vanetik-Gudes-Shimony, SEuS, SiGraM, Jeh-Widom] Focus on pattern mining, few work on association rule mining!

4 4 Tree-Query Pattern powerful tree-shaped pattern inspired by conjunctive database queries special features: –existential nodes –parameterized nodes occurrence of the pattern in G is any homomorphism from the pattern in G frequency:   x    z:  0  z   G   z  8  G   z  x   G 

5 5 Association rules fully fledged associations over tree-query patterns example:

6 6 Experimental results: Real-life datasets Food web  nodes   edges  frequency = 176 confidence = 89%

7 7 Experimental results: Real-life datasets Food web  nodes   edges  frequency = 176 confidence = 89%

8 8 Experimental results: Food web  nodes   edges  45%55%

9 9 Experimental results: Real-life datasets Protein interactions graph  nodes   edges  confidence = 10%

10 10 Experimental results: Protein interaction graph  nodes   edges  90%

11 11 Outline rest of the talk Formal problem definition Algorithm –overall approach –levelwise generation of tree patterns –generation of containment mappings –generation of parameter assignments Equivalent association rules Certhia Performance and Experimental results Future work

12 12 Tree pattern

13 13 Tree pattern

14 14 Tree pattern

15 15 Tree pattern

16 16 Tree pattern select distinct G3.to as x from G G1, G G2, G G3 where G1.from=5 and G1.to=G2.from and G1.to=G3.from and G2.to=8

17 17 Matching zz yzz x

18 18 Matching zz yzz x

19 19 Matching zz yzz x h1h1 0184

20 20 Matching zz yzz x hh 0184 hh 0188

21 21 Matching zz yzz x hh 0184 hh 0188 hh 0284

22 22 Matching zz yzz x hh 0184 hh 0188 hh 0284 hh 0285

23 23 Matching zz yzz x hh 0184 hh 0188 hh 0284 hh 0285 hh 0288

24 24 Frequency zz yzz x hh 0184 hh 0188 hh 0284 hh 0285 hh 0288 frequency = 3  

25 25 Tree Query P, body H, head Q = (H,P)

26 26 Association Rule AR: Q 1  Q 2  Confidence (AR) = freq(Q 2 )/freq(Q 1 )  Q 2  Q 1 { (x 1,x 2,x 3 ) | Q 1 (x 1,x 2,x 3 )  G}  { (x,x,6) | Q 2 (x,x,6)  G }

27 27 Examples of Association Rules (1)(2)

28 28 Association Rule AR: Q 1  Q 2  Confidence (AR) = freq(Q 2 )/freq(Q 1 )  Q 2  Q 1 { (x 1,x 2,x 3 ) | Q 1 (x 1,x 2,x 3 )  G}  { (x,x,6) | Q 2 (x,x,6)  G }

29 29 Containment Mapping containment mapping

30 30 Containment Mapping containment mapping

31 31 Containment Mapping containment mapping

32 32 Containment Mapping containment mapping

33 33 Containment Mapping containment mapping Q   Q    containment mapping from Q  to Q 

34 34 Problem statement: Mining tree queries Given a graph G and a threshold k, find all tree queries that have frequency at least k in G, those queries are called frequent.

35 35 Problem statement: Association rules Input: –a graph G –minsup –Q left frequent in G –minconf Output: All association rules Q left  Q –frequent in G –confident in G.

36 36 Algorithm: mining tree queries Outer loop: Generate, incrementally, all possible trees of increasing sizes. Avoid generation of isomorphic trees. Inner loop: For each newly generated tree, generate all queries based on that tree, and test their frequency.... x1x1 x4x4 x3x3 x2x2  x2x2 x1x1   x2x2 x1x1  xx   

37 37 Outer loop It is well known how to efficiently generate all trees uniquely up to isomorphism Based on canonical form of trees. [Scions, Li-Ruskey, Zaki, Chi-Young-Muntz]

38 38 Inner loop: Levelwise approach A query Q is characterized by  –  Q  set of existential nodes –  Q  set of selected nodes –Labeling Q  of the selected nodes by constants. Q          specializes Q          if    ,      and  agrees with  on  . If Q  specializes Q  then freq  Q    freq  Q    Most general query: T = ( , ,  )

39 39 Inner loop: Candidate generation CanTab   is a candidate query  FreqTab   is a frequent query  Q’=  ’  ’  is a parent of Q=  if either:  ’ and  has precisely one more node than  ’, or  ’ and  has precisely one more node than  ’ Join Lemma: Each candidacy table can be computed by taking the natural join of its parent frequency tables.

40 40 Inner loop: Frequency counting Each candidacy table can be computed by a single SQL query. (ref. Join lemma). Suppose: G  from  to  table in the database, then each frequency table can be computed with a single SQL query. –  »formulate in SQL and count –   »formulate   in SQL  E »natural join of E with CanTab  »group by  »count each group

41 41 Inner loop: Example  x    x   x    x   x  

42 42 Inner loop: Example  x    x   x    x   x   Join expression: CanTab {x  }{x ,x  } = FreqTab  x   x   ⋈ FreqTab  x   x   ⋈ FreqTab  x   x  

43 43 Inner loop: Example  x    x   x    x   x   Join expression: CanTab {x  }{x ,x  } = FreqTab  x   x   ⋈ FreqTab  x   x   ⋈ FreqTab  x   x  

44 44 Inner loop: Example  x    x   x    x   x   Join expression: CanTab {x  }{x ,x  } = FreqTab  x   x   ⋈ FreqTab  x   x   ⋈ FreqTab  x   x  

45 45 Inner loop: Example  x    x   x    x   x   Join expression: CanTab {x  }{x ,x  } = FreqTab  x   x   ⋈ FreqTab  x   x   ⋈ FreqTab  x   x  

46 46 Inner loop: Example  x    x   x    x   x   Join expression: CanTab {x  }{x ,x  } = FreqTab  x   x   ⋈ FreqTab  x   x   ⋈ FreqTab  x   x  

47 47 Inner loop: Example  x    x   x    x   x   SQL expression E for  x      select distinct G1.from as x1, G2.to as x3, G3.to as x4 from G G1, G G2, G G3 where G1.to = G2.from and G3.from = G2.from

48 48 Inner loop: Example  x    x   x    x   x   SQL expression for filling the frequency table: select distinct E.x1, E.x3, count(E.x4) from E, CanTab {x2}{x1,x3} as CT where E.x1 = CT.x1 and E.x3 = CT.x3 group by E.x1, E.x3 having count(E.x4) >= k

49 49 Algorithm: Mining association rules Loop 1: Generate incrementally all possible trees T of increasing sizes. Loop 2: For each T, generate all frequent tree patterns P based T. Loop 3: For each P, generate all containment mappings f from P left to P. Loop 4: For each f, generate Q=(f(H left ),P) and all parameter instantiations for Q left  Q.

50 50 Pattern database For each P a table FreqTab P, that contains all frequent parameter instantiations.  Pattern Database

51 51 Loop 3: Generation of containment mappings  Efficiently solvable, due to tree shape.

52 52 Loop 4: Generation of parameter instantiations  single relational algebra expression (SQL) 

53 53 Example: Loop 4

54 54 Example: Loop 4

55 55 Example: Loop 4 select freqQleft.x1, freqQleft.x4, freqP.x1, freqP.x4, freqP.x5, freqP.freq, freqP.freq/freqQleft.freq from freqP, freqQleft where freqQleft.x1=freqP.x1 and freqQleft.x4=freqP.x4 and freqP.freq/freqQleft.freq >= minconf

56 56 Equivalent queries Queries Q  and Q  are equivalent if same result sets on all graphs G (up to renaming of the distinguished variables) 2 cases of equivalent queries: 1.Q 1 has fewer nodes than Q 2 2.Q 1 and Q 2 have the same number of nodes

57 57 Equivalence theorem A containment mapping from Q  to Q  is a h: Q   Q  that maps distinguished variables of Q  one-to-one to distinguished variables of Q , and maps selected nodes of Q  to selected nodes of Q , preserving labels Two queries are equivalent if and only if there are containment mappings between them in both directions.

58 58 Case  : Q  fewer nodes than Q 2 Redundancy lemma: Let Q be a tree query without selected nodes. Then Q has a redundancy if and only if it contains a subtree C in the form of a linear chain of  nodes (possibly just a single node), such that the parent of C has another subtree that is at least as deep as C. Redundant subtree

59 59 Case  : Q  and Q  same number of nodes Q  and Q  must be isomorphic. Canonical form of queries: refine the canonical ordering of the underlying unlabeled tree, taking into account node labels.

60 60 Equivalent Association Rules (1) (2)

61 61 Equivalence detection for rules Many cases efficiently checked. But worst case still as hard as general graph isomorphism checking. Fast heuristics for graph isomorphism checking i.e. Nauty

62 62 Certhia Loop 1 + Loop 2: preprocessing step  Pattern database Loop 3 + Loop 4: interactive browsing tool Certhia  Demo session

63 63 Experimental results: Performance Fully implemented on top of IBM DB2 Preliminary performance results: –adequate performance –huge number of patterns –constant overhead per discovered pattern

64 64 Performance: Association rules Loop 3 and Loop 4: –very fast –constant overhead per rule

65 65 Future work Serious scientific data mining Loosen restriction to trees

66 66 Publications B. Goethals, E. Hoekx, J. Van den Bussche, “Mining tree queries in a graph”, KDD’05, p 61–69. E. Hoekx, J. Van den Bussche, “Mining tree-query associations in a graph”, ICDM’06 regular paper. “Certhia: Tree-query mining in large graphs”, ICDM’06 software demo.


Download ppt "Mining for Tree-Query Associations in a Graph Jan Van den Bussche Hasselt University, Belgium joint work with Bart Goethals (U Antwerp, Belgium) and Eveline."

Similar presentations


Ads by Google