Find Patterns Having P From P-conditional Database

Slides:



Advertisements
Similar presentations
Association rule mining
Advertisements

Mining Association Rules
Association Rules Mining
Data Mining Techniques Association Rule
Graph Mining Laks V.S. Lakshmanan
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña FP grow algorithm Correlation analysis.
Effect of Support Distribution l Many real data sets have skewed support distribution Support distribution of a retail data set.
Data Mining Association Analysis: Basic Concepts and Algorithms
CPS : Information Management and Mining
Association Rules Mining Part III. Multiple-Level Association Rules Items often form hierarchy. Items at the lower level are expected to have lower support.
FP-growth. Challenges of Frequent Pattern Mining Improving Apriori Fp-growth Fp-tree Mining frequent patterns with FP-tree Visualization of Association.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
Association Analysis: Basic Concepts and Algorithms.
Association Rule Mining. Generating assoc. rules from frequent itemsets  Assume that we have discovered the frequent itemsets and their support  How.
Frequent Pattern and Association Analysis (baseado nos slides do livro: Data Mining: C & T)
1 Association Rule Mining (II) Instructor: Qiang Yang Thanks: J.Han and J. Pei.
Frequent-Pattern Tree. 2 Bottleneck of Frequent-pattern Mining  Multiple database scans are costly  Mining long patterns needs many passes of scanning.
Association Analysis: Basic Concepts and Algorithms
Association Rule Mining. Mining Association Rules in Large Databases  Association rule mining  Algorithms Apriori and FP-Growth  Max and closed patterns.
SEG Tutorial 2 – Frequent Pattern Mining.
Chapter 5 Mining Association Rules with FP Tree Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010.
Ch5 Mining Frequent Patterns, Associations, and Correlations
Mining Frequent Patterns without Candidate Generation Presented by Song Wang. March 18 th, 2009 Data Mining Class Slides Modified From Mohammed and Zhenyu’s.
Jiawei Han, Jian Pei, and Yiwen Yin School of Computing Science Simon Fraser University Mining Frequent Patterns without Candidate Generation SIGMOD 2000.
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Frequent-Pattern Tree Approach Towards ARM Lecture
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
3.4 improving the Efficiency of Apriori A hash-based technique can be uesd to reduce the size of the candidate k- itermsets,Ck,for k>1. For example,when.
Mining Frequent Patterns without Candidate Generation.
Mining various kinds of Association Rules
Mining Frequent Patterns without Candidate Generation : A Frequent-Pattern Tree Approach 指導教授:廖述賢博士 報 告 人:朱 佩 慧 班 級:管科所博一.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Association Rule Mining III COMP Seminar GNET 713 BCB Module Spring 2007.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
UNIT 4 December 5, 2015 Data Mining: Concepts and Techniques1.
1 Data Mining: Mining Frequent Patterns, Association and Correlations.
Mining Frequent Patterns, Association, and Correlations (cont.) Pertemuan 06 Matakuliah: M0614 / Data Mining & OLAP Tahun : Feb
Data Mining  Association Rule  Classification  Clustering.
The UNIVERSITY of KENTUCKY Association Rule Mining CS 685: Special Topics in Data Mining.
CS685: Special Topics in Data Mining The UNIVERSITY of KENTUCKY Frequent Itemset Mining II Tree-based Algorithm Max Itemsets Closed Itemsets.
DATA MINING ASSOCIATION RULES.
Reducing Number of Candidates
UNIT-5 Mining Association Rules in Large Databases
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Rule Mining
Data Mining: Concepts and Techniques
Association Rule Mining
What Is Frequent Pattern Analysis?
Association Rule Mining
Mining Association Rules
Frequent Pattern Mining
Data Mining Association Rule Mining
©Jiawei Han and Micheline Kamber
Mining Association Rules in Large Databases
Association Rule Mining
Association Rule Mining
732A02 Data Mining - Clustering and Association Analysis
Mining Frequent Patterns without Candidate Generation
Frequent-Pattern Tree
©Jiawei Han and Micheline Kamber
FP-Growth Wenlong Zhang.
Association Rule Mining
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 6 —
Mining Association Rules in Large Databases
CIS671-Knowledge Discovery and Data Mining
Presentation transcript:

Find Patterns Having P From P-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base {} f:4 c:1 b:1 p:1 c:3 a:3 m:2 p:2 m:1 Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases item cond. pattern base c f:3 a fc:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1

From Conditional Pattern-bases to Conditional FP-trees For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} Header Table Item frequency head f 4 c 4 a 3 b 3 m 3 p 3 f:4 c:1 All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam {} f:3 c:3 a:3 m-conditional FP-tree c:3 b:1 b:1   a:3 p:1 m:2 b:1 p:2 m:1

Recursion: Mining Each Conditional FP-tree {} f:3 c:3 am-conditional FP-tree Cond. pattern base of “am”: (fc:3) {} f:3 c:3 a:3 m-conditional FP-tree {} Cond. pattern base of “cm”: (f:3) f:3 cm-conditional FP-tree {} Cond. pattern base of “cam”: (f:3) f:3 cam-conditional FP-tree

Mining Frequent Patterns With FP-trees Idea: Frequent pattern growth Recursively grow frequent patterns by pattern and database partition Method For each frequent item, construct its conditional pattern-base, and then its conditional FP-tree Repeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty.

Scaling FP-growth by DB Projection FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques Parallel projection is space costly

Partition-based Projection Tran. DB fcamp fcabm fb cbp p-proj DB fcam cb m-proj DB fcab fca b-proj DB f … a-proj DB fc c-proj DB f-proj DB am-proj DB cm-proj DB Parallel projection needs a lot of disk space Partition projection saves it

FP-Growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K

FP-Growth vs. Tree-Projection: Scalability with the Support Threshold Data set T25I20D100K

Why Is FP-Growth the Winner? Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

Implications of the Methodology Mining closed frequent itemsets and max-patterns CLOSET (DMKD’00) Mining sequential patterns FreeSpan (KDD’00), PrefixSpan (ICDE’01) Constraint-based mining of frequent patterns Convertible constraints (KDD’00, ICDE’01) Computing iceberg data cubes with complex measures H-tree and H-cubing algorithm (SIGMOD’01)

MaxMiner: Mining Max-patterns 1st scan: find frequent items A, B, C, D, E 2nd scan: find support for AB, AC, AD, AE, ABCDE BC, BD, BE, BCDE CD, CE, CDE, DE, Since BCDE is a max-pattern, no need to check BCD, BDE, CDE in later scan R. Bayardo. Efficiently mining long patterns from databases. In SIGMOD’98 Tid Items 10 A,B,C,D,E 20 B,C,D,E, 30 A,C,D,F Potential max-patterns

Further Improvements of Mining Methods AFOPT A “push-right” method for mining condensed frequent pattern (CFP) tree Carpenter Mine data sets with small rows but numerous columns Construct a row-enumeration tree for efficient mining

Visualization of Association Rules: Plane Graph

Visualization of Association Rules: Rule Graph

Visualization of Association Rules (SGI/MineSet 3.0)

Mining Various Kinds of Association Rules Mining multilevel association Miming multidimensional association Mining quantitative association Mining interesting correlation patterns

Mining Multiple-Level Association Rules Items often form hierarchies Flexible support settings Items at the lower level are expected to have lower support Exploration of shared multi-level mining (Agrawal & Srikant@VLB’95, Han & Fu@VLDB’95) uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support

Multi-level Association: Redundancy Filtering Some rules may be redundant due to “ancestor” relationships between items. Example milk  wheat bread [support = 8%, confidence = 70%] 2% milk  wheat bread [support = 2%, confidence = 72%] We say the first rule is an ancestor of the second rule. A rule is redundant if its support is close to the “expected” value, based on the rule’s ancestor.

Mining Multi-Dimensional Association Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes: finite number of possible values, no ordering among values—data cube approach Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches

Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated Static discretization based on predefined concept hierarchies (data cube methods) Dynamic discretization based on data distribution (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96) Clustering: Distance-based association (e.g., Yang & Miller@SIGMOD97) one dimensional clustering then association Deviation: (such as Aumann and Lindell@KDD99) Sex = female => Wage: mean=$7/hr (overall mean = $9)

Static Discretization of Quantitative Attributes Discretized prior to mining using concept hierarchy. Numeric values are replaced by ranges. In relational database, finding all frequent k-predicate sets will require k or k+1 table scans. Data cube is well suited for mining. The cells of an n-dimensional cuboid correspond to the predicate sets. Mining from data cubes can be much faster. (income) (age) () (buys) (age, income) (age,buys) (income,buys) (age,income,buys)

Quantitative Association Rules Proposed by Lent, Swami and Widom ICDE’97 Numeric attributes are dynamically discretized Such that the confidence or compactness of the rules mined is maximized 2-D quantitative association rules: Aquan1  Aquan2  Acat Cluster adjacent association rules to form general rules using a 2-D grid Example age(X,”34-35”)  income(X,”30-50K”)  buys(X,”high resolution TV”)

Mining Other Interesting Patterns Flexible support constraints Some items (e.g., diamond) may occur rarely but are valuable Customized supmin specification and application Top-K closed frequent patterns Hard to specify supmin, but top-k with lengthmin is more desirable Dynamically raise supmin in FP-tree construction and mining, and select most promising path to mine

Interestingness Measure: Correlations (Lift) play basketball  eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence Measure of dependent/correlated events: lift Basketball Not basketball Sum (row) Cereal 2000 1750 3750 Not cereal 1000 250 1250 Sum(col.) 3000 5000

Are lift and 2 Good Measures of Correlation? “Buy walnuts  buy milk [1%, 80%]” is misleading if 85% of customers buy milk Support and confidence are not good to represent correlations So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02) Milk No Milk Sum (row) Coffee m, c ~m, c c No Coffee m, ~c ~m, ~c ~c Sum(col.) m ~m  DB m, c ~m, c m~c ~m~c lift all-conf coh 2 A1 1000 100 10,000 9.26 0.91 0.83 9055 A2 100,000 8.44 0.09 0.05 670 A3 10000 9.18 8172 A4 1 0.5 0.33