Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA 90095 {

Slides:



Advertisements
Similar presentations
Cpt S 223 – Advanced Data Structures Graph Algorithms: Introduction
Advertisements

2012: J Paul GibsonT&MSP: Mathematical FoundationsMAT7003/L2-GraphsAndTrees.1 MAT 7003 : Mathematical Foundations (for Software Engineering) J Paul Gibson,
gSpan: Graph-based substructure pattern mining
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Mining Data Mining Spring Transactional Database Transaction – A row in the database i.e.: {Eggs, Cheese, Milk} Transactional Database.
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Fast Algorithms for Association Rule Mining
The Euler-tour technique
2/10/03Tucker, Sec Tucker, Applied Combinatorics, Sec. 3.2, Important Definitions Enumeration: Finding all of the possible paths in a rooted tree.
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
KNURE, Software department, Ph , N.V. Bilous Faculty of computer sciences Software department, KNURE The trees.
Trees and Tree Traversals Prof. Sin-Min Lee Department of Computer Science San Jose State University.
FAST FREQUENT FREE TREE MINING IN GRAPH DATABASES Marko Lazić 3335/2011 Department of Computer Engineering and Computer Science,
TEDI: Efficient Shortest Path Query Answering on Graphs Author: Fang Wei SIGMOD 2010 Presentation: Dr. Greg Speegle.
Sequential PAttern Mining using A Bitmap Representation
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Association Rules. CS583, Bing Liu, UIC 2 Association rule mining Proposed by Agrawal et al in Initially used for Market Basket Analysis to find.
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
Mining Multidimensional Sequential Patterns over Data Streams Chedy Raїssi and Marc Plantevit DaWak_2008.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
Foundations of Discrete Mathematics
CSCI 115 Chapter 7 Trees. CSCI 115 §7.1 Trees §7.1 – Trees TREE –Let T be a relation on a set A. T is a tree if there exists a vertex v 0 in A s.t. there.
1 Frequent Subgraph Mining Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY June 12, 2010.
5.5.2 M inimum spanning trees  Definition 24: A minimum spanning tree in a connected weighted graph is a spanning tree that has the smallest possible.
5.5.3 Rooted tree and binary tree  Definition 25: A directed graph is a directed tree if the graph is a tree in the underlying undirected graph.  Definition.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
Discrete Structures Trees (Ch. 11)
Chap 8 Trees Def 1: A tree is a connected,undirected, graph with no simple circuits. Ex1. Theorem1: An undirected graph is a tree if and only if there.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
Agenda Review: –Planar Graphs Lecture Content:  Concepts of Trees  Spanning Trees  Binary Trees Exercise.
Association Rule Mining
5.5.2 M inimum spanning trees  Definition 24: A minimum spanning tree in a connected weighted graph is a spanning tree that has the smallest possible.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Chap 6: Association Rules. Rule Rules!  Motivation ~ recent progress in data mining + warehousing have made it possible to collect HUGE amount of data.
COMP53311 Association Rule Mining Prepared by Raymond Wong Presented by Raymond Wong
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
1 Mining the Smallest Association Rule Set for Predictions Jiuyong Li, Hong Shen, and Rodney Topor Proceedings of the 2001 IEEE International Conference.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Discrete Structures Li Tak Sing( 李德成 ) Lectures
Data Mining Association Analysis: Basic Concepts and Algorithms
Discrete Mathematicsq
Frequent Pattern Mining
Mining Frequent Subgraphs
Waikato Environment for Knowledge Analysis
CARPENTER Find Closed Patterns in Long Biological Datasets
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Mining Association Rules from Stars
Data Mining Association Analysis: Basic Concepts and Algorithms
COMPS263F Unit 2 Discrete Structures Li Tak Sing( 李德成 ) Room A
Mining Frequent Subgraphs
Important Problem Types and Fundamental Data Structures
Association Analysis: Basic Concepts
Presentation transcript:

Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA { } 2003 IEEE International Conference on Data Mining

OutLine Abstract Abstract Intoduction Intoduction Canonical From for Labeled Free Trees Canonical From for Labeled Free Trees –Labeled,Rooted,Ordered Trees –Labeled,Rooted,Unordered Trees –Labeled Free Trees –Normalizing Rooted Trees –Converting to Canonical Strings Mining Frequent Subtrees Mining Frequent Subtrees Experimental Results Experimental Results Conclusions Conclusions

Abstract Tree structures : – –computational biology – –pattern recognition – –computer networks, and so on. In this paper : In this paper : –Indexing free tree. –Mining frequent subtrees. Concept: Concept: –Canonical form. –Canonical string.

Introduction Free trees – the connected, acyclic and undirected graphs. Free trees – the connected, acyclic and undirected graphs. Some real applications using free trees: Some real applications using free trees: –Shape axis tree –Multicast trees in computer networking. –Molecular evolution (phylogeny tree).

Shape Axis (SA)SA-TreeShape Contour Shape axis tree

Select a set of sequential transmissions which connect a source to a set of receivers so that the sum of the transmission energy costs is minimised. Multicast trees in computer networking :

Introduction Example (Frequent itemset mining problem) : Given the following group of transactions, that represents the items bought by customers, we can determine the support of some subsets. Given the following group of transactions, that represents the items bought by customers, we can determine the support of some subsets. –T1 {bread, milk, beer, diapers} –T2 {beer, apples, diapers} –T2 {beer, apples, diapers} –T3 {diapers, milk, beer} –T3 {diapers, milk, beer} –T4 {beer, apples, diapers} –T4 {beer, apples, diapers} –T5 {milk, bread, chocolate}  The support of {beer} is 4/5 = 80%. The support of {beer, diapers} is 4/5 = 80% The support of {beer, milk} is 2/5 = 40% Given a minimum support MinSup, an itemset X is frequent in T if: Support(X) > MinSup Given a minimum support MinSup, an itemset X is frequent in T if: Support(X) > MinSup With a transaction set T and a MinSup, the problem of frequent itemset mining is to find the complete set of frequent itemsets in T. With a transaction set T and a MinSup, the problem of frequent itemset mining is to find the complete set of frequent itemsets in T. For example : 25% minimum support For example : 25% minimum support –{chocolate} is not a frequent itemset. Its support is 1/5 = 20% < 25%, –{beer, apples} is a frequent itemset because its support is 2/5 = 40% > 25%.

Introduction Trees in applications are often labeled: Trees in applications are often labeled: –Labels attached to vertices and edges In applications, two problems are important from the database point of view. In applications, two problems are important from the database point of view. –How to index trees? –How to efficiently discover interesting patterns? One type of interesting patterns consists of those patterns that are embedded in a lot of transactions in a database. One type of interesting patterns consists of those patterns that are embedded in a lot of transactions in a database.

Canonical Form for Labeled Free Trees A rooted tree is a tree in which one vertex is singled out. A rooted tree is a tree in which one vertex is singled out. Assume that : Assume that : –Trees are rooted. –All edge labels are identical.  Each edge connects a vertex with its parent and we can consider an edge, together with its label, as a part of the child vertex.

Canonical Form for Labeled Free Trees Definition 1: Canonical form Definition 1: Canonical form – –For labeled rooted trees with height 0 (i.e., trees consisting of a single vertex), the canonical forms are the vertices themselves and the order among such trees is defined by the order of the vertex labels. – –For a labeled rooted tree with height h where h > 0, the canonical form is obtained by first normalizing all subtrees of the root then rearranging the subtrees in increasing order (from the left to the right in illustrating examples). – –For a pair of labeled rooted trees (in their canonical forms) with heights less than or equal to h where h > 0, their order is defined by first comparing the labels of their roots then comparing their corresponding subtrees from the left to the right until their relative order is resolved.

Canonical Form for Labeled Free Trees An edge connects a child vertex to its parent and the edge label is considered. “2,D” is less than branch “3,C”

Canonical Form for Labeled Free Trees The running time for the normalization is O(c*klogk), where c is the maximal fanout of the tree and k is the number of vertices in the tree.

Canonical Form for Labeled Free Trees Labeled Free Trees Labeled Free Trees

Canonical Form for Labeled Free Trees Converting to Canonical Strings Converting to Canonical Strings – –“$" to represent a backtrack and – –“#" to represent the end of the string. – –Assuming all edges have label “1” Two ways to define a canonical string: Two ways to define a canonical string: –Depth-first tree traversal   G1F1D1B$1B$1C$$1E1A$$$1F1E1A$$$1F# –Breadth first tree traversal   G$1F1F1F$1D1E$1E$$1B1B1C$1A$1A#

Mining Frequent Subtrees Frequent subtree mining problem: Frequent subtree mining problem: – –Let D denote a database where each transaction t  D is a labeled free tree. – –For a given pattern s (which is a free tree) we say s occurs in a transaction t (or t supports s) if there exists a subtree of t that is isomorphic to s. – –The support of a pattern s is the fraction of transactions in database D that supports. – –A pattern s is called frequent if its support is greater than or equal to a minimum support (minsup) specified by a user.

Mining Frequent Subtrees

Experimental Results Evaluated the performance of the FreeTreeMiner algorithm:   a group of synthetic datasets,   a chemical compounds dataset,   and a multicast trees dataset. The main results – –The running time of FreeTreeMiner algorithm scales linearly with the number of transactions in a database. – –The running time scales with the size of the frequent trees in a nonlinear fashion because of the subtree isomorphism checking algorithm. – –The number of intermediate frequent subtrees increases exponentially with the size of the maximal frequent subtree.

Conclusions A novel indexing technique for databases of labeled free trees. – –Based on a unique representation, the canonical form. – –Canonical form => Canonical string In this paper defined the frequent subtree mining problem and presented an efficient algorithm. Synthetic and real application datasets to study the performance of our algorithm. – –Full version available as Technical Report CSD-TR No at ftp://ftp.cs.ucla.edu/tech-report/2003-reports/ pdf.

MERCI