Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.

Similar presentations


Presentation on theme: "1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki."— Presentation transcript:

1 1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki

2 2 Frequent Structure Mining (FSM) Dealing with extracting patterns (association, sequence, frequent tree, graph, and etc.) in massive databases Typical application Bioinformatics Web mining Mining semi-structured documents

3 3 Tree Mining Problems Goal: to efficiently enumerate all frequent subtrees in a forest (database of trees) according to a given minimum support (minsup) The support of a subtree S is the number of trees in D that contains one occurrence of S. A subtree S is frequent if its support is more than or equal to a user specified minsup value.

4 4 Rooted, Ordered & Labeled tree A tree is an acyclic connected graph Rooted: exist one vertices which is distinguished from others Ordered: the children of each node in a rooted tree are ordered. Labeled: each node is associated with a label. Every tree in the paper is a rooted, ordered and labeled tree.

5 5 Definition of Subtrees  We say that a tree S = {Ns, Bs} is an embedded subtree of T = {N, B}, if: 1.Ns is a subset of N 2.A branch appears in S iff two vertices are on the same path from the root to a leaf in T.  We denote a tree as T = {N, B}. N is a set of labeled nodes and B is a set of branches. Hence, embedded trees allow not only direct parent-child branches, but also ancestor-descendant branches.  A disconnected pattern is a sub-forest of T.

6 6 Examples of subtrees: 0 1 1 2 4 132 1 0 234 Subtree S 1 1 2 4 32 Not a subtree, a sub-forest Tree T

7 7 Node Numbers and Labels Each node has a well- defined number, i, according to its position in a depth-first traversal of a tree The label of each node is taken from a set of labels L = {0, 1, …, m-1}. It represents the value of each node. 0 1 2 1 4 32 1 7 6 5 0 4 3 2 1

8 8 Scope of Node The scope of each node n i is given as [i, r], i.e., the lower bound is the position (i) of itself, and the upper bound is the position (r) of its right- most leaf node. Assume two node x, y has the following scope S x = [i x, r x ] and S y = [i y, r y ]. Sx is strictly less than (<) Sy iff r x < l y, i.e., Sx occurs before Sy. It means that y is an embedded sibling of x Sx contains Sy iff l x = r y. It means that y is a descendant of x 0 1 2 1 4 32 1 [1,4] [0,7] [2,3] [3,3] [5,7] [4,4][6,7][7,7]

9 9 Representing trees as Strings The String Encoding: 0 2 1 1 –1 –1 1 –1 –1 4 3 –1 2 –1 -1 To create String encoding, which is denoted as t, we perform a depth-first search starting (also ending) at the root, adding the current node’s label x to t. Whenever we backtrack from a child to its parent we add an special symbol –1 to the string. 0 1 1 2 4 132

10 10 Equivalence Classes Two k-subtrees X, Y are in the same prefix equivalence class iff they share a common prefix up to the (k-1)th nodes 0 1 2 3 x x x  Prefix String: 2 1 0 –1 3  The following three subtrees are in the same prefix equivalence class: 2 1 0 –1 3 –1 –1 x –1 // (x, 0) 2 1 0 –1 3 –1 x –1 –1 // (x, 1) 2 1 0 –1 3 x –1 –1 –1 // (x, 3)  Element list: (label, the position of the node which x is attached) (x, 0); (x, 1); (x, 3)  A valid element x may be attached to only those that lie on the path from the root to the right-most leaf. x Not a valid element!

11 11 Candidate Generation: Goal: Given an equivalence class of k-subtrees, try to obtain candidate (k+1)-subtrees. Main idea: consider each pair of elements in the class for extension, including self-extension. Theorem: Assume elements are kept sorted by node label as the primary key and position as the secondary key. Let P be a prefix class, and (x,i) and (y, j) denote any two elements in the class. Px denotes the class representing extension of element (x, i). Define (y,j) join (x,i ) as follows: Case I ( i = j ): 1) If P ≠ 0, add (y, j) and (y, j+1) to Px. 2) If P = 0, add (y, j) to Px. Case II ( i > j ): add (y,j) to Px Case III ( i < j ): no new element is possible in this case  The Theorem has a mistake.

12 1 2 Prefix: 1 2 Element List: (3, 1); (4, 0) 1 24 3 1 2 3 1 2 3 1 2 3 1 24 1 24 3 3 4 Prefix = 1 2 3Prefix = 1 2 –1 4 (3,1) (3,2) (4,0) (3,1) join (3,1) (4,0) join (3,1) 4 (4,0) (4,2) 4 If we add (y, j+1), i.e., (4, 1), we get the following tree: 1 2 4 –1 4, wrong! (4,0) join (4,0)

13 13 TreeMiner Algorithm TreeMiner (D(database of tree, Forest), minsup)  F1 = { frequent 1-subtrees };  F2 = { classes [P] 1 of frequent 2-subtrees };  For all [P], do Enumerate-Frequent-Subtree; Enumerate-Frequent-Subtree F k  For each element (x, i) € [P] do  For each element (y, j) € [P] do  (y,j) join (x, i) => at most two new candidate subtrees  For each subtree, do scope-list joins  If it is frequent, then we add the subtree to the list of frequent-subtree.  Repeated until all frequent subtrees have been enumerated. P: prefix class. [P] 1 means the prefix size = 1, i.e., only one node in the prefix class. P x refers to the new prefix tree formed by adding (x, i) to P. F k: the set of all frequent subtrees of size k.

14 14 An example of TreeMiner Algorithm 1 23 4 2 1 2 3 2 4 1 35 12 23 4 Tree T0 Tree T1 Tree T2 Database D of 3 Trees D in Horizontal Format: (tree-id, string encoding): (T0, 1 2 –1 3 4 –1 –1) (T1, 2 1 2 –1 4 –1 –1 2 –1 3 –1) (T2, 1 3 2 –1 –1 5 1 2 –1 3 4 –1 –1 –1 -1) D in Vertical Format ( tree-id, scope) pairs: 1 2345 0, [0,3] 0, [1,1] 0, [2,3] 0, [3,3] 2, [3,7] 1, [1,3] 1, [0,5] 1, [5,5] 1, [3,3] 2, [0,7] 1, [2,2] 2, [1,2] 2, [7,7] 2, [4,7] 1, [4,4] 2, [6,7] 2, [2,2] 2, [5,5] 0 1 2 3 0 1 23 45 0 1 2 3 4 56 7

15 Step 1: Calculate F1: Prefix = {}, Element list: (1,-1), (2,-1), (3,-1), (4,-1) 1234 0,[0,3]* 0,[1,1] 0,[2,3] 0,[3,3] 1,[1,3] 1,[0,5] 1,[5,5] 1,[3,3] 2,[0,7] 1,[2,2] 2,[1,2] 2,[7,7] 2,[4,7] 1,[4,4] 2,[6,7] 2,[2,2] 2,[5,5] Infrequent Element: (5,-1) Step 2: Calculate F2: Suppose Prefix = {1}, Element list:(2,0), (4,0) 11 0,0,[1,1]* 0,0,[3,3] 1,1,[2,2] 1,1,[3,3] 2,0,[2,2] 2,0,[7,7] 2,0,[5,5] 2,4,[7,7] 2,4,[5,5] Infrequent Element: (1,0), (3,0) 2 4 Step 3: Calculate F 3 : Suppose Prefix = {1,2}, Element list:(4,0) 1 0,01,[3,3]* 1,12,[3,3] 2,02,[7,7] 2,05,[7,7] 2,45,[7,7] Infrequent Element: (2,0), (2,1), (4,0) 2 4 Scope-List Joins Example: minsup = 100% *: 0 – tree id [0,3] – node scope *: 0 – tree id 0 – the node number (position) of the prefix {1} [1,1] – scope of the element node. *: 0 – tree id 01 – the node number (position) of the prefix {12} [3,3] – scope of the element node.

16 16 Conclusion  Introduce the notion of mining embedded subtrees in a (forest) database of trees  Systematic candidate subtree generation. No subtree is generated more than once. (but has a mistake)  Use a string encoding of tree to store dataset efficiently  Use a node’s scope to develop scope-lists  Introduce a new algorithm – TreeMiner


Download ppt "1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki."

Similar presentations


Ads by Google