1 Data Mining: Mining Frequent Patterns, Association and Correlations.

1 Data Mining: Mining Frequent Patterns, Association and Correlations

2 Mining Frequent Patterns, Association and Correlations Basic concepts and a road map Efficient and scalable frequent itemset mining methods Mining various kinds of association rules From association mining to correlation analysis Summary

3 What Is Frequent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining Motivation: Finding inherent regularities in data What products were often purchased together?— Beer and diapers?! What are the subsequent purchases after buying a PC? What kinds of DNA are sensitive to this new drug? Can we automatically classify web documents? Assume all data are categorical. No good algorithm for numeric data.

4 Why Is Freq. Pattern Mining Important? Discloses an intrinsic and important property of data sets Forms the foundation for many essential data mining tasks Association, correlation, and causality analysis Sequential, structural (e.g., sub-graph) patterns Pattern analysis in spatiotemporal, multimedia, time- series, and stream data Classification: associative classification Cluster analysis: frequent pattern-based clustering Data warehousing: cube-gradient Broad applications

Association Rules Rules for customers that buy PC for buying antivirus software PC  antivirus-sw [support =2%, conf= 60%] Support  2% of customers both PC and antivirus-sw together Confidence  60% of PC buyers also buys antivirus-sw Needs to satisfy minimum support and confidence 5

Definitions 6 Let be set of all items I = {bread, milk, beer, chocolate, egg, diaper} Transaction T 1 = {milk, egg} Itemset Set of items k-itemset: itemset with k items Frequent itemset Itemsets that satisfies a frequency threshold called minimum support count

CS583, Bing Liu, UIC7 Transaction data: a set of documents A text document data set. Each document is treated as a “bag” of keywords doc1: Student, Teach, School doc2: Student, School doc3: Teach, School, City, Game doc4: Baseball, Basketball doc5: Basketball, Player, Spectator doc6: Baseball, Coach, Game, Team doc7: Basketball, Team, City, Game

Association Rules Association rule: an implication A  B A ⊂ I, B ⊂ I, A ∩ B= ∅ Holds with some support s and confidence c Support (A  B): the percentage of transactions that contain A U B which is P(A U B) Confidence (A  B): the conditional probability that a transaction having A also contains B which is P(B|A) = freq(AUB) / freq(A) Minimum support and minimum confidence are prespecified thresholds 8

9 Basic Concepts: Frequent Patterns and Association Rules Itemset I = {i 1, …, i k } Find all the rules X  Y with minimum support and confidence support, s, probability that a transaction contains X  Y confidence, c, conditional probability that a transaction having X also contains Y Let sup min = 50%, conf min = 50% Freq. itemset: {A:3, B:3, D:4, E:3, AD:3} Association rules: A  D (60%, 100%) D  A (60%, 75%) Customer buys diaper Customer buys both Customer buys beer Transaction-idItems bought 1A, B, D 2A, C, D 3A, D, E 4B, E, F 5B, C, D, E, F

Association Rule mining Goal: Find all rules that satisfy the user-specified minimum support and minimum confidence Association Rule mining can be viewed in two steps Find all frequent itemsets: itemsets that occur at least as frequently as prespecified minimum support count min_sup Generate strong association rules: that satisfy minimum support and minimum confidence First part is more costly 10

11 Closed Patterns Huge number of itemsets: if an itemset is frequent then all its subsets are frequent A long pattern contains a combinatorial number of sub- patterns, e.g., {a 1, …, a 100 } contains = Too huge to compute or store Solution: Mine closed patterns instead

Closed Patterns An itemset X is closed if X is frequent and there exists no super-pattern Y כ X, with the same support as X (proposed by Pasquier, et al. @ ICDT’99) i.e. use the biggest possible itemset for a support level Closed pattern is a lossless compression of freq. patterns Reducing the # of patterns and rules 12

13 Closed Patterns Exercise. DB = {, } Min_sup = 1. What is the set of closed itemset? : 1 : 2 What is the set of all patterns? 2 100 -1!!

15 Scalable Methods for Mining Frequent Patterns The downward closure property of frequent patterns Any subset of a frequent itemset must be frequent If {beer, diaper, nuts} is frequent, so is {beer, diaper} i.e., every transaction having {beer, diaper, nuts} also contains {beer, diaper} Scalable mining methods: Several approaches Apriori (Agrawal & Srikant@VLDB’94) Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00) Any algorithm should find the same set of rules although their computational efficiencies and memory requirements may be different

16 Apriori: A Candidate Generation-and-Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated

17 The Apriori Algorithm—An Example Database TDB 1 st scan C1C1 L1L1 L2L2 C2C2 C2C2 2 nd scan C3C3 L3L3 3 rd scan TidItems 10A, C, D 20B, C, E 30A, B, C, E 40B, E Itemsetsup {A}2 {B}3 {C}3 {D}1 {E}3 Itemsetsup {A}2 {B}3 {C}3 {E}3 Itemset {A, B} {A, C} {A, E} {B, C} {B, E} {C, E} Itemsetsup {A, B}1 {A, C}2 {A, E}1 {B, C}2 {B, E}3 {C, E}2 Itemsetsup {A, C}2 {B, C}2 {B, E}3 {C, E}2 Itemset {B, C, E} Itemsetsup {B, C, E}2 Sup min = 2

18 The Apriori Algorithm Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

19 Important Details of Apriori How to generate candidates? Step 1: self-joining L k Step 2: pruning How to count supports of candidates? Example of Candidate-generation L 3 ={abc, abd, acd, ace, bcd} Self-joining: L 3 *L 3 abcd from abc and abd acde from acd and ace Pruning: acde is removed because ade is not in L 3 C 4 ={abcd}

Step 2: Generating rules from frequent itemsets Frequent itemsets  association rules One more step is needed to generate association rules For each frequent itemset X, For each proper nonempty subset A of X, Let B = X - A A  B is an association rule if Confidence(A  B) ≥ minconf, support(A  B) = support(A  B) = support(X) confidence(A  B) = support(A  B) / support(A)

Generating rules: an example Suppose {2,3,4} is frequent, with sup=50% Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectively These generate these association rules: 2,3  4, confidence=100% 2,4  3, confidence=100% 3,4  2, confidence=67% 2  3,4, confidence=67% 3  2,4, confidence=67% 4  2,3, confidence=67% All rules have support = 50% support(A  B) = support(A  B)

Generating rules: summary To recap, in order to obtain A  B, we need to have support(A  B) and support(A) All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more. This step is not as time-consuming as frequent itemsets generation.

23 Bottleneck of Apriori Alg Multiple database scans are costly Mining long patterns needs many passes of scanning and generates lots of candidates To find frequent itemset i 1 i 2 …i 100 # of scans: 100 Bottleneck: candidate-generation-and-test Can we avoid candidate generation?

24 Mining Frequent Patterns Without Candidate Generation Grow long patterns from short ones using local frequent items “abc” is a frequent pattern Get all transactions having “abc”: DB|abc “d” is a local frequent item in DB|abc  abcd is a frequent pattern

25 Construct FP-tree from a Transaction Database min_support = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order, f-list 3.Scan DB again, construct FP-tree {} f:1 c:1 a:1 m:1 p:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

26 Construct FP-tree from a Transaction Database min_support = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order, f-list 3.Scan DB again, construct FP-tree {} f:2 c:2 a:2 b:1m:1 p:1m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 F-list=f-c-a-b-m-p

27 Construct FP-tree from a Transaction Database min_support = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order, f-list 3.Scan DB again, construct FP-tree {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 F-list=f-c-a-b-m-p

28 Construct FP-tree from a Transaction Database min_support = 3 TIDItems bought (ordered) frequent items 100{f, a, c, d, g, i, m, p}{f, c, a, m, p} 200{a, b, c, f, l, m, o}{f, c, a, b, m} 300 {b, f, h, j, o, w}{f, b} 400 {b, c, k, s, p}{c, b, p} 500 {a, f, c, e, l, p, m, n}{f, c, a, m, p} 1.Scan DB once, find frequent 1-itemset (single item pattern) 2.Sort frequent items in frequency descending order, f-list 3.Scan DB again, construct FP-tree F-list=f-c-a-b-m-p {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

29 Find Patterns Having P From P-conditional Database Starting at the frequent item header table in the FP-tree Traverse the FP-tree by following the link of each frequent item p Accumulate all of transformed prefix paths of item p to form p’s conditional pattern base Conditional pattern bases itemcond. pattern base cf:3 afc:3 bfca:1, f:1, c:1 mfca:2, fcab:1 pfcam:2, cb:1 {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3

Example: conditional patterns having ‘p’ Start with the bottom of “header table” Find all paths to ‘p’ Those are: (f:4, c:3, a:3, m:2, p:2) and (c:1, b:1, p:1) Paths that contain ‘p’ (f:2, c:2, a:2, m:2, p:2) and (c:1, b:1, p:1) Find prefix paths (remove ‘p’) (f:2, c:2, a:2, m:2) and (c:1, b:1) 30

31 From Conditional Pattern-bases to Conditional FP-trees For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 {} f:3 c:3 a:3 m-conditional FP-tree All frequent patterns relate to m m, fm, cm, am, fcm, fam, cam, fcam   {} f:4c:1 b:1 p:1 b:1c:3 a:3 b:1m:2 p:2m:1 Header Table Item frequency head f4 c4 a3 b3 m3 p3 min_support = 3

Another example TransactionitemsetSorted itemsetFreq items T1a,b,eb,a,eb:7 T2b,d a:6 T3b,c c:6 T4a,b,db,a,dd:2 T5a,c e:2 T6b,c T7a,c T8a,b,c,eb,a,c,e T9a,b,cb,a,c 32 Min_sup = 2

Example –cont. ItemCond. patternCond. FP-treeFreq. patterns e{b,a:1}{b,a,c:1} {b,e:2}{a,e:2}{a,b,e:2} d{b,a:1}{b:1} {b,d:2} c{b,a:2}{b:2}{a:2} {b,c:4}{a,c:4}{a,b,c:2} a{b:4} {b,a:4} 33

34 Benefits of the FP-tree Structure Completeness Preserve complete information for frequent pattern mining Never break a long pattern of any transaction Compactness Reduce irrelevant info—infrequent items are gone Items in frequency descending order: the more frequently occurring, the more likely to be shared Never be larger than the original database (not count node-links and the count field)

35 Partition Patterns and Databases Frequent patterns can be partitioned into subsets according to f-list F-list=f-c-a-b-m-p Patterns containing p Patterns having m but no p … Patterns having c but no a nor b Pattern f Completeness and non-redundency

36 Scaling FP-growth by DB Projection FP-tree cannot fit in memory?—DB projection First partition a database into a set of projected DBs Then construct and mine FP-tree for each projected DB Parallel projection vs. Partition projection techniques Parallel projection is space costly

37 Why Is FP-Growth the Winner? Divide-and-conquer: decompose both the mining task and DB according to the frequent patterns obtained so far leads to focused search of smaller databases Other factors no candidate generation, no candidate test compressed database: FP-tree structure no repeated scan of entire database basic ops—counting local freq items and building sub FP-tree, no pattern search and matching

39 Mining Various Kinds of Association Rules Mining multilevel association Miming multidimensional association Mining quantitative association Mining interesting correlation patterns

40 Mining Multiple-Level Association Rules Items often form hierarchies Flexible support settings Items at the lower level are expected to have lower support Logitech wireless laser mouse M505 & Norton antivirus 2015 Difficult to find strong associations use multiple min_sup uniform support Milk [support = 10%] 2% Milk [support = 6%] Skim Milk [support = 4%] Level 1 min_sup = 5% Level 2 min_sup = 5% Level 1 min_sup = 5% Level 2 min_sup = 3% reduced support

CS583, Bing Liu, UIC41 Problem for uniform support If the frequencies of items vary a great deal, we will encounter two problems If min_sup is set too high, those rules that involve rare items will not be found. To find rules that involve both frequent and rare items, min_sup has to be set very low. This causes combinatorial explosion because those frequent items will be associated with one another in all possible ways.

42 Mining Multi-Dimensional Association Single-dimensional rules: buys(X, “milk”)  buys(X, “bread”) Multi-dimensional rules:  2 dimensions or predicates Inter-dimension assoc. rules (no repeated predicates) age(X,”19-25”)  occupation(X,“student”)  buys(X, “coke”) hybrid-dimension assoc. rules (repeated predicates) age(X,”19-25”)  buys(X, “popcorn”)  buys(X, “coke”) Categorical Attributes: finite number of possible values, no ordering among values—(occupation, brand, color) Quantitative Attributes: numeric, implicit ordering among values—discretization, clustering, and gradient approaches

43 Mining Quantitative Associations Techniques can be categorized by how numerical attributes, such as age or salary are treated 1.Static discretization based on predefined concept hierarchies (data cube methods) 2.Dynamic discretization based on data distribution (quantitative rules) Intervals dynamically determined 3.Clustering: Distance-based association one dimensional clustering then association

44 Mining Other Interesting Patterns Flexible support constraints Some items (e.g., diamond) may occur rarely but are valuable Customized sup min specification and application Top-K closed frequent patterns Hard to specify sup min

Interesting Measure - Lift Strong rules are not necessarily important Only an estimate, not measure real strength of correlation play basketball  eat cereal [40%, 66.7%] is misleading The overall % of students eating cereal is 75% > 66.7%. play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence 46 BasketballNot basketballSum (row) Cereal200017503750 Not cereal10002501250 Sum(col.)300020005000

47 Interestingness Measure: Correlations (Lift) Measure of dependent/correlated events: lift If lift > 1 positively correlated If lift < 1 negatively correlated If lift = 0 independent BasketballNot basketballSum (row) Cereal200017503750 Not cereal10002501250 Sum(col.)300020005000

49 Summary Frequent pattern mining—an important task in data mining Support & confidence Scalable frequent pattern mining methods Apriori (Candidate generation & test) FPgrowth (Projection-based)  Mining a variety of rules and interesting patterns  Correlation analysis  lift

1 Data Mining: Mining Frequent Patterns, Association and Correlations.

Similar presentations

Presentation on theme: "1 Data Mining: Mining Frequent Patterns, Association and Correlations."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Data Mining: Mining Frequent Patterns, Association and Correlations.

Similar presentations

Presentation on theme: "1 Data Mining: Mining Frequent Patterns, Association and Correlations."— Presentation transcript:

Similar presentations

About project

Feedback