Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA 90095 {

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA 90095 { ychi,yyr,muntzg@cs.ucla.edu } 2003 IEEE International Conference on Data Mining

OutLine Abstract Abstract Intoduction Intoduction Canonical From for Labeled Free Trees Canonical From for Labeled Free Trees –Labeled,Rooted,Ordered Trees –Labeled,Rooted,Unordered Trees –Labeled Free Trees –Normalizing Rooted Trees –Converting to Canonical Strings Mining Frequent Subtrees Mining Frequent Subtrees Experimental Results Experimental Results Conclusions Conclusions

Abstract Tree structures : – –computational biology – –pattern recognition – –computer networks, and so on. In this paper : In this paper : –Indexing free tree. –Mining frequent subtrees. Concept: Concept: –Canonical form. –Canonical string.

Introduction Free trees – the connected, acyclic and undirected graphs. Free trees – the connected, acyclic and undirected graphs. Some real applications using free trees: Some real applications using free trees: –Shape axis tree –Multicast trees in computer networking. –Molecular evolution (phylogeny tree).

Shape Axis (SA)SA-TreeShape Contour Shape axis tree

Select a set of sequential transmissions which connect a source to a set of receivers so that the sum of the transmission energy costs is minimised. Multicast trees in computer networking :

Introduction Example (Frequent itemset mining problem) : Given the following group of transactions, that represents the items bought by customers, we can determine the support of some subsets. Given the following group of transactions, that represents the items bought by customers, we can determine the support of some subsets. –T1 {bread, milk, beer, diapers} –T2 {beer, apples, diapers} –T2 {beer, apples, diapers} –T3 {diapers, milk, beer} –T3 {diapers, milk, beer} –T4 {beer, apples, diapers} –T4 {beer, apples, diapers} –T5 {milk, bread, chocolate}  The support of {beer} is 4/5 = 80%. The support of {beer, diapers} is 4/5 = 80% The support of {beer, milk} is 2/5 = 40% Given a minimum support MinSup, an itemset X is frequent in T if: Support(X) > MinSup Given a minimum support MinSup, an itemset X is frequent in T if: Support(X) > MinSup With a transaction set T and a MinSup, the problem of frequent itemset mining is to find the complete set of frequent itemsets in T. With a transaction set T and a MinSup, the problem of frequent itemset mining is to find the complete set of frequent itemsets in T. For example : 25% minimum support For example : 25% minimum support –{chocolate} is not a frequent itemset. Its support is 1/5 = 20% < 25%, –{beer, apples} is a frequent itemset because its support is 2/5 = 40% > 25%.

Introduction Trees in applications are often labeled: Trees in applications are often labeled: –Labels attached to vertices and edges In applications, two problems are important from the database point of view. In applications, two problems are important from the database point of view. –How to index trees? –How to efficiently discover interesting patterns? One type of interesting patterns consists of those patterns that are embedded in a lot of transactions in a database. One type of interesting patterns consists of those patterns that are embedded in a lot of transactions in a database.

Canonical Form for Labeled Free Trees A rooted tree is a tree in which one vertex is singled out. A rooted tree is a tree in which one vertex is singled out. Assume that : Assume that : –Trees are rooted. –All edge labels are identical.  Each edge connects a vertex with its parent and we can consider an edge, together with its label, as a part of the child vertex.

Canonical Form for Labeled Free Trees Definition 1: Canonical form Definition 1: Canonical form – –For labeled rooted trees with height 0 (i.e., trees consisting of a single vertex), the canonical forms are the vertices themselves and the order among such trees is defined by the order of the vertex labels. – –For a labeled rooted tree with height h where h > 0, the canonical form is obtained by first normalizing all subtrees of the root then rearranging the subtrees in increasing order (from the left to the right in illustrating examples). – –For a pair of labeled rooted trees (in their canonical forms) with heights less than or equal to h where h > 0, their order is defined by first comparing the labels of their roots then comparing their corresponding subtrees from the left to the right until their relative order is resolved.

Canonical Form for Labeled Free Trees An edge connects a child vertex to its parent and the edge label is considered. “2,D” is less than branch “3,C”

Canonical Form for Labeled Free Trees The running time for the normalization is O(c*klogk), where c is the maximal fanout of the tree and k is the number of vertices in the tree.

Canonical Form for Labeled Free Trees Labeled Free Trees Labeled Free Trees

Canonical Form for Labeled Free Trees Converting to Canonical Strings Converting to Canonical Strings – –“$" to represent a backtrack and – –“#" to represent the end of the string. – –Assuming all edges have label “1” Two ways to define a canonical string: Two ways to define a canonical string: –Depth-first tree traversal   G1F1D1B$1B$1C$$1E1A$$$1F1E1A$$$1F# –Breadth first tree traversal   G$1F1F1F$1D1E$1E$$1B1B1C$1A$1A#

Mining Frequent Subtrees Frequent subtree mining problem: Frequent subtree mining problem: – –Let D denote a database where each transaction t  D is a labeled free tree. – –For a given pattern s (which is a free tree) we say s occurs in a transaction t (or t supports s) if there exists a subtree of t that is isomorphic to s. – –The support of a pattern s is the fraction of transactions in database D that supports. – –A pattern s is called frequent if its support is greater than or equal to a minimum support (minsup) specified by a user.

Mining Frequent Subtrees

Experimental Results Evaluated the performance of the FreeTreeMiner algorithm:   a group of synthetic datasets,   a chemical compounds dataset,   and a multicast trees dataset. The main results – –The running time of FreeTreeMiner algorithm scales linearly with the number of transactions in a database. – –The running time scales with the size of the frequent trees in a nonlinear fashion because of the subtree isomorphism checking algorithm. – –The number of intermediate frequent subtrees increases exponentially with the size of the maximal frequent subtree.

Conclusions A novel indexing technique for databases of labeled free trees. – –Based on a unique representation, the canonical form. – –Canonical form => Canonical string In this paper defined the frequent subtree mining problem and presented an efficient algorithm. Synthetic and real application datasets to study the performance of our algorithm. – –Full version available as Technical Report CSD-TR No. 030041 at ftp://ftp.cs.ucla.edu/tech-report/2003-reports/030041.pdf.

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA 90095 {

Similar presentations

Presentation on theme: "Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA 90095 {"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA 90095 {

Similar presentations

Presentation on theme: "Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA 90095 {"— Presentation transcript:

Similar presentations

About project

Feedback