Graph Mining Laks V.S. Lakshmanan

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Mining Frequent Patterns Using FP-Growth Method Ivan Tanasić Department of Computer Engineering and Computer Science, School of Electrical.
CSE 634 Data Mining Techniques
Association rules and frequent itemsets mining
gSpan: Graph-based substructure pattern mining
COMP5318 Knowledge Discovery and Data Mining
ICDM'06 Panel 1 Apriori Algorithm Rakesh Agrawal Ramakrishnan Srikant (description by C. Faloutsos)
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.
FP-Growth algorithm Vasiljevic Vladica,
Data Mining Association Analysis: Basic Concepts and Algorithms
CPS : Information Management and Mining
Rakesh Agrawal Ramakrishnan Srikant
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
Data Mining Association Analysis: Basic Concepts and Algorithms
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
Fast Algorithms for Association Rule Mining
1 1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 6 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Performance and Scalability: Apriori Implementation.
SEG Tutorial 2 – Frequent Pattern Mining.
Subgraph Containment Search Dayu Yuan The Pennsylvania State University 1© Dayu Yuan9/7/2015.
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Mining Frequent Patterns without Candidate Generation.
Graph Indexing: A Frequent Structure- based Approach Alicia Cosenza November 26 th, 2007.
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
Frequent Item Mining. What is data mining? =Pattern Mining? What patterns? Why are they useful?
University at BuffaloThe State University of New York Lei Shi Department of Computer Science and Engineering State University of New York at Buffalo Frequent.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
CanTree: a tree structure for efficient incremental mining of frequent patterns Carson Kai-Sang Leung, Quamrul I. Khan, Tariqul Hoque ICDM ’ 05 報告者:林靜怡.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Reducing Number of Candidates Apriori principle: – If an itemset is frequent, then all of its subsets must also be frequent Apriori principle holds due.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Gspan: Graph-based Substructure Pattern Mining
DATA MINING ASSOCIATION RULES.
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Reducing Number of Candidates
Mining in Graphs and Complex Structures
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining: Concepts and Techniques
Data Mining: Concepts and Techniques
Association rule mining
Byung Joon Park, Sung Hee Kim
Market Basket Analysis and Association Rules
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Vasiljevic Vladica, FP-Growth algorithm Vasiljevic Vladica,
Mining Association Rules in Large Databases
Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영.
Find Patterns Having P From P-conditional Database
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
732A02 Data Mining - Clustering and Association Analysis
Mining Frequent Patterns without Candidate Generation
Frequent-Pattern Tree
FP-Growth Wenlong Zhang.
Mining Path Traversal Patterns with User Interaction for Query Recommendation 龚赛赛
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets
Mining Association Rules in Large Databases
Presentation transcript:

Graph Mining Laks V.S. Lakshmanan Based on Xifeng Yan Jiawei Han: gSpan: Graph-Based Substructure Pattern Mining. ICDM 2002. Also see their tech. report (uiuc-cs web site).

What’s the relevance to this course? Spec., what does GM have to do with influence analysis? Connection is technical. Need to build up some background in GM to appreciate connection. Will do influence analysis after covering GM.

The problem Given a database of graphs*, find sub- graphs with support >= minSup (user- specified threshold). *graph may be labeled, weighted, etc. Will consider a simple version for clarity. There is another problem formulation: Given one graph, find frequent pattern sub-graphs.

Basics Given database D = {G0, ..., Gn}, pattern graph g, supportD(g)=#graphs in D that contain g as a sub-graph. Want to find all g: support(g)>=minSup. Mainly interested in frequent connected subgraphs. (Why?) Analogous to: Given a database of market basket transactions, find itemsets S: support(S)>=minSup – used for association rules and correlations.  Quick review of AR mining next.

Review of Association Rule Mining 1/2 I = universe of items. t = transaction (set of items). D = database (set of transactions). S – a set of items. support(S) = |{tЄD|S\subseteq t}|. Given minSup, S is frequent if support(S) >= minSup. For sets X, Y, conf(XY) = support(XY)/support(X). Given D, find all rules XY: XY is frequent and conf(XY) >= minConf.

Review of AR Mining 2/2 Two major steps: Find all frequent item sets. Find all ARs w/ conf.>=minConf. By far, step 1 is the expensive one. Numerous algorithms exist. Will sketch two major ones. 1. Apriori (classic). 2. FP-trees (more efficient and recent).

Itemset Lattice {A,B,C} {A,C} {A,B} {B,C} {C} {B} {A} {} Just conceptual: real itemset lattice is over tens of thousands of items!

Apriori Finding frequent itemsets: Sweap itemset lattice bottom-up. Generate candidate itemsets for checking frequency: If X is infrequent, none of its supersets can be frequent [anti-monotone, downward closed]. Check frequency of candidate itemsets and filter through frequent ones. Repeat until candidate set is empty.

Apriori Generate rules: Candidate generation: For each frequent itemset S, for every partition X, Y of S, compute conf(XY) and output if conf >=minConf. Candidate generation: Let Ck and C’k be any two frequent itemsets of size k s.t. |Ck \cap Ck’| = k-1. Easy to pick such Cs if we keep them sorted. Then Ck \cup Ck’ is a viable candidate set. Details: Agrawal & Srikant: VLDB 1994.

FP-trees Apriori is quite efficient, but has an expensive step of candidate itemset generation built-in. Not all candidates may turn out to be frequent. Can we cut down on or even completely avoid candidate generation? Answer: “Yes, using FP-trees!” [Jian Pei et al. Mining Frequent Patterns without Candidate Generation. SIGMOD 2000.]

Key strengths of FP-trees based approach Compress database into a compact structure (tree plus links) which may often fit in memory. Make each frequent (single) item a node and use prefix sharing to factor transactions. Frequent items are more likely to combine to form frequent itemsets. Never need to look at original DB again. Two passes over the FP-tree yields all frequent patterns w/o candidate generation.

Example DB from paper TID Items 1 f,a,c,d,g,i,m,p 2 a,b,c,f,l,m,o 3 b,f,h,j,o 4 b,c,k,s,p 5 a,f,c,e,l,p,m,n Assume minSup = 3. Scan 1: Find all frequent items  f: 4, c: 4, a: 3, b: 3, m: 3, p: 3 (ordered in descending frequency order). Scan 2: read in every transaction and insert or increment an existing node count, using prefix sharing. Prefix sharing achieves great compression.

Example (contd.) root f:4 f c:1 c c:3 b:1 b:1 a a:3 b p:1 m:3 m p p:2 He root Header f:4 f c:1 c c:3 b:1 b:1 a a:3 b p:1 m:3 m p p:2 b:1

FP-tree construction Simple recursive algorithm: when you read a transaction (ordered in desc. freq. order and keeping only freq. items), reflect each item into the current tree by either: Incrementing count of (prefix) matching node, OR Starting a new branch.

Frequent Pattern Generation Start with header table, in asc. freq. order. Item p: freq(p) = 3 (just follow the link path from the header table entry for p). What about patterns containing p? Consider the p-projected DB: {(fcam: 2), (cb:1)}. c:3 is the only frequent item here. The p- conditional FP-tree is c:3 p:3 []

FP Generation (contd.) Item m: m is frequent w/ freq(m) = 3. m- projected DB = {(fca:2), (fcab:1)}. Resulting m-conditional FP-tree = “(fca:3)”. The call mine(fca, m) generates recursive calls mine(fc, am), mine(f, cm), and mine({},fm). These calls generate the FPs m:3, am:3, cm:3, fm:3, cam:3, fam:3, fcm:3, fcam:3, i.e., all FPs containing m. Other nodes (items) processed similarly.

FP-trees concluding remarks Note avoidance of expensive candidate generation step. Read both the SIGMOD 2000 paper and the journal version DMKD 2004 for more details and ideas of parellelizing using divide and conquer. Practical Note: OTOH, Apriori is one of the highly optimized codes for pattern mining.

Additional Topics on Frequent Patterns ARs with constraints: e.g., XY, such that SUM(X.price) <= c and AVG(Y.price) >= c’. (See Ng et al. SIGMOD 98 and Lakshmanan et al. SIGMOD 99 and numerous follow- ups.) Leung et al. 20?? Push constraints into FP- tree algorithm. Closed patterns: background and Galois lattice theory. (see Zaki et al. and Pasquier et al.) Maximal patterns and patterns with highest frequency.

Graph Mining Resumed Early work on GM followed the footsteps of Apriori: Generate candidate subgraphs of size (k+1) from size k (frequent) subgraphs (expensive). Eliminate infrequent candidates with a subgraph isomorphism test (NP-complete). gSpan replaces both by a simple depth-first search (no false positives to prune!). Key idea: DFS code. Recall – main interestfreq. connected subgraphs.

gSpan Framework G = (V, E, L, l) – node- and edge-labeled, undirected. Only interested in connected frequent subgraphs. Frequent(g)  support(g) >= minSup. Recall notions of isomorphism, subgraph iso, and automorphism.

DFS and Pruning See tech. report, Fig. 1, p5. How does it relate to the item set lattice for frequent item set mining? DFS Tree: (GT) see TR, Fig. 2, p6. Clearly many DFS trees possible for a graph. Root, right-most node, right-most path. Forward and backward edges. Partial order on fwd & bwd edges.

DFS & Pruning (contd.) 1: PO on fwd edges: (i1,j1) ‹ (i2,j2) iff j1 < j2: e.g., (v0,v1) < (v2,v3) in Fig. 2b. 2: PO on bwd edges: (i1,j1) ‹ (i2,j2) iff either (a) i1 < i2 OR (b) i1=i2 & j1 < j2: e.g., (v2,v0) < (v3,v1) [Fig. 2b] and (v2,v0) < (v3,v0) [Fig. 2c-d].

DFS & Pruning (contd.) 3: PO b/w fwd and bwd edges: e1 = (i1,j1) < e2 = (i2,j2) iff: e1 is fwd and e2 is bwd and j1≤i2 OR e1 is bwd and e2 is fwd and i1<j2. Theorem: 1-3 defines a total order. Proof: Exercise. Exercise: Prove that the above TO is equiv. to: “< is the transitive closure of: (i) e1<e2 if i1=i2 & j1<j2; (ii) e1<e2 if i1<j1 & j1=i2.”

DFS & Pruning (contd.) DFS Code: The unique ordering of edges in a DFS tree (e.g., Fig. 2b-d). Procedure: Pick start node (root). Repeat { Add fwd edge connecting current “code” to any new node; Add all bwd edges connecting new node to any old node} Until (no more edges left). E.g., Fig. 2.

DFS & Pruning (contd.) But a given graph has numerous DFS codes, making it complex to check if a pattern g is isomorphic to a subgraph of some graph Gi \in D. Solution: In one pass, gather counts of node labels and edge labels and order them in non-ascending frequency order. We will assume frequency order = alphabetical order, for clarity.

DFS &Pruning (contd.) e.g.: X-a->X < X-b->X, X-a->X < X-a->Y, etc. Think simple lex ordering on the strings of length 3 (nodeLabel.edgeLabel.nodeLabel). Visiting time (wrt DFS) is important too: e.g., (0,1,?,?,?) < (0,2,?,?,?), (1,?,?,?,?), etc. So what? Well, we get a unique minimum DFS code for a graph. Revisit Fig. 2a. \gamma is its minimum DFS code. Where does this help: Pruning (revisit Fig. 1)!

gSpan Algorithm Graph Set Projection. Sub-graph Mining. Makes crucial use of min. DFS code; Involves expensive step of computing min. version of given DFS code. Remarks: gSpan one of the best known algorithms for graph mining. Much more efficient than prior art (synthetic and real data sets). Still quite expensive.

Graph Mining (Final Remarks) Considerable work since gSpan. Constraints on patterns and pushing constraints into the “mining loop”. Xifeng Yan et al. Mining significant graph patterns by leap search. SIGMOD 2007. M. Deshpande et al. Frequent substructure-based approaches for classifying chemical compounds. IEEE TKDE. 17:1036--1050, 2005. X. Yan et al. Graph indexing: A frequent structure- based approach. SIGMOD, 335--346, 2004. M. Hasan et al. ORIGAMI: Mining representative orthogonal graph patterns. In Proc. of ICDM, pages 153--162, 2007.

F. Pennerath and A. Napoli. Mining frequent most informative subgraphs F. Pennerath and A. Napoli. Mining frequent most informative subgraphs. In the 5th Int. Workshop on Mining and Learning with Graphs, 2007. X. Yan and J. Han. CloseGraph: Mining closed frequent graph patterns. SIGKDD, 286--295, 2003.